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ABSTRACT 


COMBINATORIAL  MARKOV  RANDOM  FIELDS 
AND  THEIR  APPLICATIONS  TO 
INFORMATION  ORGANIZATION 

FEBRUARY  2008 

RON  BEKKERMAN 

B.Sc.,  TECHNION-  ISRAEL  INSTITUTE  OF  TECHNOLOGY 
M.Sc.,  TECHNION— ISRAEL  INSTITUTE  OF  TECHNOLOGY 
Ph.D.,  UNIVERSITY  OF  MASSACHUSETTS  AMHERST 

Directed  by:  Professor  James  Allan 

We  propose  a  new  type  of  undirected  graphical  models  called  a  Combinatorial 
Markov  Random  Field  (Comraf)  and  discuss  its  advantages  over  existing  graphical 
models.  We  develop  an  efficient  inference  methodology  for  Comrafs  based  on  com¬ 
binatorial  optimization  of  information-theoretic  objective  functions;  both  global  and 
local  optimization  schema  are  discussed.  We  apply  Comrafs  to  multi-modal  cluster¬ 
ing  tasks:  standard  (unsupervised)  clustering,  semi-supervised  clustering,  interactive 
clustering,  and  one-class  clustering.  For  the  one-class  clustering  task,  we  analytically 
show  that  the  proposed  optimization  method  is  optimal  under  certain  simplifying 
assumptions.  We  empirically  demonstrate  the  power  of  Comraf  models  by  comparing 
them  to  other  state-of-the-art  machine  learning  techniques,  both  in  text  clustering 
and  image  clustering  domains.  For  unsupervised  clustering,  we  show  that  Comrafs 
consistently  and  significantly  outperform  three  previous  state-of-the-art  clustering 


techniques  on  six  real-world  textual  datasets.  For  semi-supervised  clustering,  we 
show  that  the  Comraf  model  is  superior  to  a  well-known  constrained  optimization 
method.  For  interactive  clustering,  Comraf  obtains  higher  accuracy  than  a  Support 
Vector  Machine,  trained  on  a  large  amount  of  labeled  data.  For  one-class  clustering, 
Comrafs  demonstrate  superior  performance  over  two  previously  proposed  methods. 
We  summarize  our  thesis  by  giving  a  comprehensive  recipe  for  machine  learning  mod¬ 
eling  with  Comrafs. 
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CHAPTER  1 


INTRODUCTION 


Graphical  models  have  proven  themselves  to  be  a  useful  tool  in  machine  learning, 
showing  excellent  results  in  information  retrieval  [81],  natural  language  processing 
[93],  computer  vision  [45],  and  a  variety  of  other  fields  [57].  A  striking  benefit  of 
using  graphical  models  is  the  availability  of  black-box  inference  algorithms;  once  a 
model  is  designed,  it  is  usually  straightforward  to  apply  an  existing  optimization 
procedure  to  make  inferences  in  the  model.  Nonetheless,  existing  graphical  models 
have  certain  limitations,  both  within  supervised  and  unsupervised  frameworks  (that 
is,  when  training  data  is  available  or  unavailable,  respectively). 

Supervised  learning  problems  are  usually  solved  using  either  generative  graphical 
models  (i.e.  Bayesian  networks  [87])  or  discriminative  graphical  models  (such  as  con¬ 
ditional  random  fields  [66]).  While  the  goal  of  inference  in  generative  models  is  to 
estimate  model  parameters  represented  jointly  with  the  data,  the  goal  of  inference  in 
discriminative  graphical  models  is  to  estimate  model  parameters  given  the  data,  in 
a  conditional  manner.  The  major  problem  of  the  discriminative  approach  is  that  in 
order  to  construct  a  useful  model,  a  large  amount  of  labeled  data  is  required.  If  the 
amount  of  available  labeled  data  is  not  enough  for  training  a  model,  it  often  over- 
fits'.  i.e.  it  performs  well  on  data  similar  to  the  training  data,  but  shows  significantly 
worse  results  on  “unexpected”  data  instances.  Unfortunately,  it  is  usually  impossible 
to  decide  whether  the  amount  of  available  training  data  is  enough  for  constructing  an 
effective  model.  Also,  a  supervised  model  can  perform  poorly  if  trained  on  low-quality, 
noisy  data. 
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Unsupervised  learning  tasks  are  often  performed  using  generative  graphical  models 
(discriminative  models  are  inapplicable  to  these  tasks).  The  structure  of  a  genera¬ 
tive  model  describes  a  hypothetical  procedure  according  to  which  the  data  was  pre¬ 
sumably  generated.  To  design  a  generative  model,  practitioners  traditionally  make 
assumptions  about  the  model’s  structure,  based  on  domain  knowledge,  the  need  for 
computational  tractability,  or  both.  Such  assumptions  may  be  inappropriate  and 
thus  introduce  undesired  bias  into  the  model.  Another  potentially  problematic  issue 
is  that  modern  generative  models  consist  of  thousands  or  even  millions  of  nodes — 
such  models  are  difficult  to  fit,  analyze  and  learn  from  data  (model  learning  can  easily 
become  infeasible  if  no  significant  restrictions  on  the  class  of  models  are  made). 

Since  both  generative  and  discriminative  graphical  models  have  significant  draw¬ 
backs,  other  types  of  graphical  models  are  emerging,  which  now  becomes  an  active 
topic  in  machine  learning.  Recently,  McCallum  et  al.  [79]  proposed  a  model  that 
combines  generative  and  discriminative  training.  LeCun  and  Huang  [69]  proposed 
energy-based  models  which  allow  optimization  of  non-normalized  objective  functions 
factorized  over  a  graphical  model.  However,  both  models  are  proposed  only  within 
the  framework  of  supervised  learning. 

In  this  thesis,  we  develop  a  new  type  of  graphical  model  that  has  the  following 
characteristics: 

•  Unsupervised  or  semi-unsupervised  flavor.  The  model  is  not  overly  de¬ 
pendent  on  the  quantity  and  quality  of  training  data,  but  rather  is  applicable 
to  the  cases  when  no  or  little  labeled  data  is  available.  Even  if  the  amount  of 
labeled  data  is  sufficiently  large,  the  model  does  not  assume  the  data’s  purity, 
but  takes  advantage  of  this  data  by  maximizing  agreement  of  unlabeled  and 
labeled  data  in  a  semi-unsupervised  setup. 

•  Minimal  bias;  minimal  prior  knowledge  to  be  incorporated.  The  num¬ 
ber  of  assumptions  made  on  the  model  structure  is  as  small  as  possible.  In 
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particular,  no  generative  assumptions  are  made,  which  minimizes  the  risk  of 
making  assumptions  that  are  misleading  or  unnecessary. 

•  Compactness,  ability  for  model  learning  and  comprehensive  analysis 
of  the  model  behavior.  Graphical  models  with  millions  of  nodes  are  difficult 
to  comprehend  and  analyze.  We  take  into  account  the  fact  that  learning  the 
model  structure  can  be  optimal  only  for  small  models  [27]. 

To  meet  the  criteria  above,  we  construct  a  graphical  model  which  is  intrinsically 
different  from  existing  graphical  models.  The  most  important  difference  is  that  in 
our  model,  a  certain  portion  of  the  model  complexity  is  transferred  from  its  graph 
topology  into  its  nodes,  such  that  a  resulting  model  consists  of  a  small  number  of 
“rich”  nodes.  It  turns  out  that  such  a  model  is  straightforwardly  applicable  to  multi¬ 
modal  learning  problems. 

Multi-modal  learning  is  a  learning  framework  in  the  environment  where  multiple 
views  (or  modalities )  of  the  data  are  available.  For  example,  in  the  text  domain, 
a  set  of  documents  is  one  modality  of  the  data,  while  a  set  of  words  within  those 
documents  is  another  modality.  In  fact,  most  real-world  datasets  are  multi-modal. 
Multi-modality  of  the  data  can  be  observed  in  a  variety  of  research  fields,  such  as: 

•  Text  processing:  documents,  words,  authors,  titles,  part-of-speech  tags; 

•  Image  processing:  images,  colors,  texture,  blobs,  interest  points,  caption 
words; 

•  Video  processing:  video  signal,  audio  signal,  frames,  subtitles,  transcripts; 

•  Bioinformatics:  patients,  tissues,  samples,  genes,  proteins,  compounds; 

•  Web  information  retrieval:  Web  pages,  words,  hyperlinks,  markup  primi¬ 
tives; 
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•  Data  mining:  movies,  actors,  directors,  production  companies; 
and  many  others. 

Three  decades  ago  McGurk  and  MacDonald  published  their  pioneering  work  [80] 
that  revealed  the  multi-modal  nature  of  speech  perception:  sound  and  moving  lips 
compose  one  system,  so  to  better  process  audio  signals,  an  audio/video  interaction 
should  be  modeled.  Since  then,  machine  learning  researchers  have  widely  exploited 
data  multi-modality,  using  a  variety  of  approaches,  such  as  multi-modal  neural  net¬ 
works  [32],  multivariate  information  bottleneck  [46],  and  multi- view  expectation  max¬ 
imization  [21]. 

We  propose  a  graphical  model  for  multi-modal  learning,  only  one  node  of  which  is 
assigned  for  each  modality,  while  edges  represent  statistical  interactions  between  the 
modalities.  Since  such  interactions  are  symmetric,  the  resulting  model  is  undirected, 
i.e.  they  adopt  the  Markov  Random  Field  (MRF)  formalism.  All  the  applications  that 
we  consider  in  this  thesis  will  be  of  the  multi-modal  nature,  however,  in  our  future 
work,  we  will  explore  other  types  of  possible  applications. 

The  model  we  propose  has  the  desired  characteristics  listed  above: 

•  Multi-modality  discloses  the  high-level  structure  of  data,  being  therefore  a  cheap 
and  easily  available  form  of  supervision.  Indeed,  while  obtaining  labeled  exam¬ 
ples  is  expensive,  deciding  which  data  views  are  relevant  to  a  particular  task 
in  hand  is  usually  straightforward.  Taking  advantage  of  this  additional,  struc¬ 
tural  knowledge  allows  us  to  successfully  solve  unsupervised  and  semi-supervised 
problems. 

•  The  only  domain  knowledge  incorporated  into  the  model  is  availability  /  us¬ 
ability  of  multiple  modalities  and  their  interaction  patterns.  No  assumptions 
about  prior  distributions,  latent  variables  and  data  point-wise  interactions  are 
made  which  minimizes  the  model’s  bias. 
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•  Meaningful  models  can  consist  of  just  a  handful  of  nodes,  allowing  easy  analysis. 
For  example,  the  problem  of  choosing  the  most  influential  interactions  between 
nodes  can  be  straightforwardly  solved  by  testing  a  number  of  potentially  good 
combinations  (or  even  all  possible  combinations,  in  case  of  models  with  only 
few  nodes). 

In  this  thesis  we  explore  a  range  of  multi-modal  clustering  tasks,  including  one- 
class  clustering.  We  consider  only  discrete  tasks  over  finite  datasets,  and  we  note  that 
those  tasks  have  a  combinatorial  nature:  given  a  dataset  of  n  instances,  the  standard 
(hard)  clustering  is  the  problem  of  partitioning  these  instances  into  k  groups,  whereas 
one-class  clustering  is  the  problem  of  selecting  k  instances — both  are  well-known  com¬ 
binatorial  problems.  Therefore,  we  represent  these  learning  tasks  as  combinatorial 
optimization.  In  multi-modal  cases,  our  goal  is  to  simultaneously  solve  multiple  com¬ 
binatorial  optimization  problems,  one  for  each  data  modality. 

To  summarize,  the  contributions  of  this  thesis  are  as  follows: 

1.  We  propose  a  new  type  of  graphical  model  called  a  Combinatorial  Markov  Ran¬ 
dom  Field  (Comraf)  that  has  beneficial  properties  (as  discussed  above):  it  mod¬ 
els  a  high-level  structure  of  the  data,  represented  as  a  handful  of  “rich”  nodes 
(that  correspond  to  data  modalities)  and  interactions  between  them.  The  inner 
structure  of  Comraf’s  nodes  is  apparent  and  therefore  does  not  require  an  ex¬ 
plicit  graphical  representation  in  the  model,  which  results  in  a  light  and  elegant 
layout. 

2.  We  show  that  Comrafs  are  a  natural  modeling  framework  for  multi-modal  prob¬ 
lems,  able  to  obtain  excellent  results  on  real-world  tasks.  For  each  task,  a  par¬ 
ticular  objective  function  is  designed  that  best  fits  the  task.  Therefore,  Comrafs 
are  more  flexible  than  most  graphical  models  which  are  mainly  limited  to  using 
maximum  likelihood  (ML)  or  maximum  a  posteriori  (MAP)  objectives. 
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3.  We  apply  Comrafs  to  unsupervised,  semi-supervised,  interactive  and  one-class 
clustering  tasks.  We  represent  each  task  as  a  combinatorial  optimization  prob¬ 
lem  of  the  multi-modal  nature.  To  our  knowledge,  most  of  the  proposed  tasks 
are  novel:  we  are  not  aware  of  previous  work  on  semi-supervised  multi-modal 
clustering,  interactive  multi-modal  clustering,  or  one-class  multi-modal  cluster¬ 
ing. 

4.  We  design  information-theoretic  objective  functions  for  our  models.  In  the  case 
of  one-class  clustering,  we  show  that  optimizing  our  objective  function  leads  to 
an  optimal  solution,  under  some  simplifying  assumptions.  Also,  in  the  case  of 
multi-modal  clustering,  we  show  that  incorporating  our  objective  function  into 
the  Comraf  model  nicely  generalizes  previous  successful  clustering  models. 

5.  We  propose  combinatorial  optimization  methods  for  solving  our  learning  prob¬ 
lems,  for  each  of  which  we  design  efficient  combinatorial  algorithms  and  analyze 
their  computational  complexity 

6.  Overall,  we  present  a  formal  framework  for  multi-modal  learning  that  brings 
together  two  research  areas:  graphical  models  and  combinatorial  optimization. 

The  rest  of  this  thesis  is  organized  as  follows:  in  Chapter  2  we  provide  some 
necessary  background;  in  Chapter  3  we  describe  the  Comraf  model;  after  which  we 
discuss  each  Comraf  application  in  turn:  (unsupervised)  clustering  in  Chapter  4, 
semi-supervised  and  interactive  clustering  in  Chapter  5,  and  one-class  clustering  in 
Chapter  6.  In  Chapter  7,  we  summarize  previous  chapters  by  exploring  a  variety  of 
Comraf  modeling  possibilities  on  an  example  of  image  clustering.  In  Chapter  8,  we 
conclude  and  discuss  advanced  Comraf  problems  that  are  not  described  in  depth  in 
this  thesis,  such  as  multi-modal  ranking. 
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CHAPTER  2 


PRELIMINARIES 


In  this  chapter,  we  first  provide  background  information  on  graphical  models, 
and  in  particular  on  Markov  Random  Fields.  We  then  present  three  major  machine 
learning  paradigms:  supervised,  semi-supervised,  and  unsupervised  learning.  Finally, 
we  concentrate  on  data  clustering — the  most  important  application  of  unsupervised 
learning — for  which  we  give  some  necessary  definitions  and  insights. 

2.1  Markov  Random  Fields 

A  graphical  model  is  a  tuple  (G,  P ),  where  G  is  a  graph  whose  nodes  correspond  to 
random  variables  X  =  {Ad, . . . ,  Xm}  and  whose  edges  E  denote  interactions  between 
these  variables;  P  is  a  joint  probability  distribution  defined  over  X.  Let  us  use  a 
short  notation  P(x)  =  P( X1  —  x±, . . . ,  Xn  =  xn),  where  each  x%  is  a  value  from  Xfs 
domain. 

Definition  2.1.1  A  graphical  model  (G,  P)  is  called  a  Markov  Random  Field  (MRF) 
if  the  following  two  conditions  hold: 

•  (Positivity)  \/x  :  P(x)  >  0 

•  (Markovianity)  Let  Xi;  X2  and  X3  be  three  disjoint  subsets  of  random  variables 
in  X.  We  have  that 


P(X1,X2|X3)  =  P(X1|X3)P(X2|X3) 
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(i.e.  Xi  and  X2  are  conditionally  independent  given  X3y)  iff  every  path  between 


a  node  from  Xi  and  a  node  from  X2  contains  a  node  from  X3. 

The  Markov  blanket  of  a  node  Xt  is  defined  as  the  set  of  all  the  immediate  neigh¬ 
bors  of  Xj  in  G.  The  Markovianity  can  then  be  restated  as  having  each  variable  X, 
conditionally  independent  of  the  rest  of  the  model,  given  its  Markov  blanket.  Note 
symmetric  dependencies  between  nodes  in  an  MRF — those  dependencies  are  repre¬ 
sented  in  G  by  undirected  edges.  Consequently,  an  MRF  is  often  referred  to  as  an 
undirected  graphical  model. 

An  important  observation  of  an  MRF  is  that  the  joint  distribution  P  is  given 
but  (in  most  cases)  not  fully  observed.  The  goal  of  an  inference  procedure  in  an 
MRF  is  then  to  answer  questions  about  the  distribution  P,  such  as  what  is  the  most 
likely  assignment  x*  =  to  variables  {Xi, . . . ,  Xm}  (this  task  is  called 

the  Most  Probable  Explanation — MPE ,  see,  e.g.,  [75]).  Naturally,  answering  most 
such  questions  is  NP-hard  since  it  potentially  requires  considering  every  possible 
assignment.  Thus,  most  inference  techniques  fall  into  the  category  of  approximation 
methods. 

Definition  2.1.2  A  distribution  P  is  called  a  Gibbs  distribution  if  it  can  be  written 
in  the  form 


where 

•  C  is  a  clique  in  the  graph  G; 

•  fc  is  a  real-valued  function  defined  over  values  of  random  variables  from  C; 

•  Z  is  a  normalization  factor. 

We  refer  to  functions  fc  as  log-potential  functions  (this  term  reflects  the  fact  that  their 
exponents  are  traditionally  referred  to  as  potential  functions ).  The  normalization 
factor  Z  is  called  a  partition  function. 


Figure  2.1.  An  example  of  a  Markov  Random  Field. 

First  proven  by  Julian  Besag  [19],  the  Hammersley-Clifford  theorem  states  that 
Theorem  2.1.3  The  tuple  (A,  G )  is  an  MRF  if  and  only  if  P  is  a  Gibbs  distribution. 

Note  that  log-potential  functions  can  be  defined  on  cliques  of  any  size,  however, 
smaller  cliques  are  usually  preferred  from  the  computational  point  of  view.  For  exam¬ 
ple,  consider  an  MRF  from  Figure  2.1  where  X1 ,  X2 ,  X3 ,  A"4  are  multinomial  random 
variables,  each  with  10,000  possible  values.  We  can  consider  two  cliques  of  size  3, 
i.e.  X]  =  {Xi,  X2,  X3}  and  X2  =  {X2,  X3l  X4}  and  then  the  joint  distribution  P(x) 
can  be  factorized  over  those  cliques  as: 

P(x)  =  y  exP  ^(Xl) exp  S  ^(X2)> 

i  i 

such  that  each  log-potential  function  f\  will  have  to  have  1012  values.  Inference  in  a 
model  like  that  can  be  infeasible  in  practice.  We  can  also  consider  five  cliques  of  size  2 
(i.e.  the  edges  {Ad,  A"2},  {Xi,X3},  {X2,X3},  {X2,X4},  and  {X3,  X4}),  and  factorize 
the  joint  distribution  accordingly.  In  this  case,  the  log-potentials  /,:  will  have  to  have 
only  108  values  which  is,  in  many  cases,  feasible. 

2.2  Three  major  learning  paradigms 

In  this  thesis,  we  employ  MRFs  for  solving  unsupervised  and  semi- supervised 
learning  problems.  When  possible,  we  compare  our  results  with  the  ones  of  supervised 
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learning  methods,  such  as  a  Support  Vector  Machine  (SVM)  [109].  The  most  widely 
studied  type  of  supervised  learning  problems  is  classification-,  a  model  is  trained  on 
(a  large  number  of)  data  instances,  each  of  which  was  a  priori  associated  with  one 
or  more  target  classes  (we  say  that  it  was  labeled).  The  model  is  then  applied  to 
associate  other,  unlabeled  data  instances  with  the  target  classes.  Obviously,  it  is 
burdensome  to  collect  and  label  the  training  data. 

In  contrast,  in  unsupervised  learning,  the  model  is  built  to  fit  the  data  as  it  is, 
where  no  labeled  instances  are  necessary.  Data  clustering  is  the  main  example  of 
unsupervised  learning  problems.  There  are  two  versions  of  data  clustering:  hard  and 
soft.  In  hard  clustering,  we  partition  the  set  of  data  instances  into  groups  ( clusters ) 
such  that  these  groups  are  as  homogeneous  as  possible  (according  to  a  given  criterion). 
Soft  clustering  is  applied  when  data  instances  can  belong  to  more  than  one  cluster: 
each  data  instance  is  associated  with  all  the  clusters  according  to  a  certain  probability 
distribution.  In  this  thesis,  we  will  consider  only  the  hard  clustering  task,  leaving  the 
soft  clustering  case  for  future  work. 

Unsupervised  learning  problems  are  usually  solved  in  graphical  models  using  the 
maximum  likelihood  (ML)  framework  (see,  e.g.,  [22]),  where  model  parameters  that 
best  explain  the  data  are  estimated.  Most  ML  methods  deal  with  approximating  Zf, 
which  is  generally  a  hard  task,  because  Z f  depends  on  the  particular  choice  of  ff  s 
and  is  a  sum  over  all  the  possible  configurations.  In  this  thesis,  we  will  apply  the 
MPE  framework  instead,  for  the  reasons  that  will  be  clear  later. 

Semi-supervised  learning  is  usually  viewed  from  two  difference  perspectives:  (a)  as 
training  a  supervised  model  while  taking  advantage  of  available  unlabeled  instances; 
(b)  as  building  an  unsupervised  model  that  takes  advantage  of  some  labeled  data, 
whose  amount  is  not  enough  to  train  a  supervised  model.  In  this  thesis,  we  focus  on 
the  latter  type,  which  is  often  called  semi-supervised  clustering  [116]. 
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2.3  Clustering 

Most  existing  data  clustering  algorithms  belong  to  one  of  two  categories:  hierar¬ 
chical  (top-down  or  bottom-up)  and  flat.  A  flat  algorithm  starts  with  data  instances 
distributed  over  k  clusters  (where  k  is  the  desired  number  of  clusters)  and  reorga¬ 
nizes  /  updates  the  clusters  until  convergence.  A  top-down  hierarchical  algorithm 
is  initialized  with  one  cluster  containing  all  data  instances,  which  is  then  iteratively 
split  into  portions  until  the  desired  number  of  clusters  k  is  achieved.  A  bottom-up 
hierarchical  algorithm  starts  with  singleton  clusters  (one  data  instance  per  cluster) 
and  merges  clusters  iteratively  until,  again,  k  is  reached.  An  obvious  drawback  of 
flat  algorithms  as  compared  to  hierarchical  ones  is  in  the  fact  that  flat  procedures  are 
often  heavily  dependent  on  their  initialization:  most  of  them  perform  poorly  when 
initialized  at  random.  Many  heuristics  have  been  proposed  that  come  up  with  mean¬ 
ingful  initial  clusters  (see,  e.g.,  [40]),  however,  most  of  these  heuristics  are  domain 
specific.  Therefore,  in  this  thesis  we  concentrate  on  hierarchical  clustering  schema, 
although  we  occasionally  mention  flat  methods  as  well  (see  e.g.  Section  4.4). 
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CHAPTER  3 


COMBINATORIAL  MARKOV  RANDOM  FIELDS 


In  this  chapter,  we  first  introduce  the  notion  of  a  combinatorial  random  vari¬ 
able,  then  propose  Combinatorial  Markov  Random  Fields  (Comrafs),  and  develop  an 
inference  technique  for  Comrafs  based  on  combinatorial  optimization. 

3.1  Combinatorial  random  variables 

Definition  3.1.1  A  combinatorial  random  variable  (or  combinatorial  r.v.)  Xc  is  a 
discrete  random  variable  defined  over  a  combinatorial  set. 

A  combinatorial  set  in  mathematical  parlance  means  a  set  of  all  subsets,  parti¬ 
tionings,  permutations  etc.  of  a  given  finite  set.  To  capture  this  intuition,  we  define 
a  finite  set  A  as  combinatorial  if  its  size  is  exponential  with  respect  to  another  finite 
set  B,  i.e.  log  |A|  =  0(\B\).  As  an  example,  a  combinatorial  r.v.  Xc  can  be  defined 
over  all  the  outcomes  of  lotto  6  of  49,  in  which  6  balls  are  selected  from  49  enumer¬ 
ated  balls  to  produce  an  outcome  of  the  lottery.  In  this  case,  set  B  consists  of  49 
balls,  while  set  A  consists  of  (49)  possible  choices  of  6  balls  from  B.  In  a  fair  lottery, 
the  distribution  of  Xc  is  uniform:  each  outcome  is  drawn  with  probability  l/(49). 
However,  in  an  unfair  lottery,  some  outcomes  are  more  probable  than  others. 

It  is  easy  to  come  up  with  other  examples  of  combinatorial  r.v.’s:  over  all  the 
possible  translations  of  a  sentence,  over  all  the  possible  orderings  in  a  ranked  list 
of  retrieved  documents,  etc.  In  Chapter  4  we  consider  combinatorial  r.v.’s  over  all 
partitionings  of  a  given  set;  in  Chapter  6  we  consider  combinatorial  r.v.’s  over  all 
subsets  of  a  given  set. 
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From  the  theoretical  perspective,  a  combinatorial  r.v.  behaves  exactly  as  an  or¬ 
dinary  discrete  random  variable  with  a  finite  domain.  However,  from  the  practical 
point  of  view,  a  combinatorial  r.v.  is  different:  in  most  real-world  cases,  the  event 
space  of  Xc  is  so  large  that  the  distribution  P(XC)  cannot  be  explicitly  specified. 
Moreover,  the  Most  Probable  Explanation  (MPE)  task  (see  Chapter  2)  for  combina¬ 
torial  r.v.’s  can  be  computationally  hard.  Considering  an  unfair  lottery  example,  in 
which  the  distribution  of  Xc  is  flat  (close  to  uniform),  say,  the  probability  of  value 
{7,23,29,35,48,49}  is  0  and  the  probability  of  value  {4,18,28,37,39,43}  is  2/(49), 
while  the  rest  of  the  values  still  have  the  probability  1  /  (49) .  An  exponentially  long 
sampling  process  is  required  to  detect  the  most  probable  value. 

3.2  Combinatorial  Markov  Random  Fields 

Definition  3.2.1  A  combinatorial  Markov  random  field  (Comraf)  is  a  Markov  Ran¬ 
dom  Field,  at  least  one  node  of  which  is  a  combinatorial  random  variable. 

In  this  thesis,  we  will  consider  only  Comraf  models,  every  node  of  which  is  a 
combinatorial  r.v.  As  in  any  other  MRF,  random  variables  in  Comraf  models  can  be 
in  either  a  hidden  or  observed  state.  A  combinatorial  r.v.  is  hidden  if  it  can  take  any 
value  from  its  event  space.  A  combinatorial  r.v.  is  observed  if  its  value  is  preset  and 
fixed.  Chapter  4  presents  Comraf  models  with  only  hidden  variables.  In  Chapters  5 
and  7  we  introduce  observed  random  variables  to  Comraf  models. 

An  edge  =  (A7"?,  Xf)  in  a  Comraf  graph  corresponds  to  a  statistical  interaction 
between  combinatorial  r.v.’s  Xf  and  Xf.  A  presence  or  an  absence  of  edge  artic¬ 
ulates  whether  Xf  and  Xf  stay  in  a  tight  statistical  interaction  or  not.  For  example, 
consider  three  nodes  in  a  Comraf  graph  for  an  email  collection,  one  of  which  (Mc) 
corresponds  to  the  modality  of  email  messages,  another  (Ac)  to  the  authors  of  the 
messages,  and  the  third  one  ( Sc )  to  the  subject  lines.  Obviously,  email  messages  stay 
in  statistical  interactions  both  with  their  authors  and  their  subjects.  However,  it  is 
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not  straightforward  whether  the  authors’  modality  interacts  with  the  subject  lines. 
Indeed,  the  subject  line  in  the  first  message  of  an  email  thread  was  given  by  its  sender, 
while  all  the  other  messages  in  this  thread  are  often  “forced”  to  use  the  same  subject 
line,  possibly  given  by  another  sender.  Therefore,  it  might  be  natural  to  have  edges 
(Mc,  Ac )  and  (Mc,  Sc)  in  the  Comraf  graph,  and  to  drop  the  ( Ac ,  Sc )  edge. 

One  might  argue  that,  while  the  ( Ac ,  Sc )  interaction  is  not  clearly  present,  it 
might  still  exist  in  the  data.  As  in  any  graphical  model,  there  is  a  tradeoff  between 
the  Comraf’s  adequacy  and  the  computational  complexity  of  its  inference  procedure. 
The  larger  the  Comraf  model  is,  the  more  difficult  the  inference  would  be.  Thus,  it 
is  the  practitioner’s  responsibility  to  decide  which  edges  will  be  present  in  the  model 
and  which  will  be  absent.  Let  us  emphasize  this  again:  since  Comraf  models  are 
usually  compact,  a  model  learning  procedure  can  be  used  to  automatically  infer  the 
optimal  set  of  model’s  edges.  Keeping  in  mind  the  model  learning  option,  we  leave  it 
for  our  future  research.  Also  note  that,  as  in  any  other  MRF,  the  lack  of  statistical 
interaction  between  variables  Xf  and  Xc-  (and  therefore  the  absence  of  edge  in  a 
Comraf  graph)  implies  conditional  independence  of  Xf  and  Xj  given  the  rest  of  the 
model. 

As  discussed  above  in  Section  3.1,  even  simple  inference  tasks  (trivially  performed 
on  ordinary  random  variables)  are  computationally  hard  for  combinatorial  random 
variables.  Since  every  combinatorial  r.v.  carry  a  large  portion  of  a  Comraf  complex¬ 
ity,  even  small  Comraf  models  (of  just  a  few  nodes)  remain  non-trivial.  Inference  in 
Comrafs  is  thus  viewed  from  a  different  perspective  than  inference  in  other  graphical 
models.  Usually,  an  inference  procedure  is  composed  from  traversing  the  graph  G 
and  performing  computations  at  the  graph’s  nodes.  In  most  graphical  models,  where 
nodes  are  ordinary  random  variables,  the  computation  step  is  simple,  while  traversing 
the  (large)  graph  is  a  resource-demanding  process.  In  these  cases,  it  is  very  important 
to  keep  track  of  numerous  intermediate  computations.  Simplicity  and  homogeneity 
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of  such  process  play  a  crucial  role  in  those  models.  For  example,  it  is  impractical  to 
optimize  different  objective  functions  in  different  regions  of  the  graph  G.  These  con¬ 
siderations  dramatically  restrict  practitioners  in  their  choice  of  an  objective  function 
for  their  models.  Most  graphical  models  optimize  the  maximum  likelihood  objec¬ 
tive  (see  Chapter  2).  However,  the  situation  is  different  for  Comrafs.  In  Comrafs, 
computations  performed  in  each  node  are  the  most  intensive  part  of  the  inference 
process.  Traversing  the  graph  however  is  relatively  inexpensive  as  the  number  of 
nodes  is  small  in  comparison  to  other  models.  Thus,  practically  unrestricted  variety 
of  objective  functions  can  be  considered,  both  probabilistic  and  non-probabilistic, 
homogenous  and  heterogenous  in  various  regions  of  G. 

Let  us  now  show  that  optimizing  an  arbitrary  objective  function  over  G  can  be 
represented  in  terms  of  an  MPE  inference  in  a  Comraf.  As  discussed  in  Chapter  2, 
the  joint  distribution  of  random  variables  in  an  MRF  is  factored  over  the  graph  G  as: 


f 


where  log-potential  functions  f%  are  arbitrary  functions  defined  over  cliques  in  G. 
If  we  fix  the  log-potentials  ft  for  each  clique,  the  partition  function  Zf  becomes  a 
constant.  Thus,  in  the  MPE  inference,  we  directly  optimize  a  non-normalized  linear 
combination  of  the  log-potential  functions: 


which  now  solely  depends  on  the  choice  of  the  log-potentials. 

3.3  Algorithmic  aspects  of  inference  in  Comrafs 

As  we  have  mentioned  in  Section  3.1,  in  most  cases  it  is  infeasible  to  explicitly 
specify  the  distribution  P( Xc),  i.e.  to  represent  it  as  a  probability  table  in  which 
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each  value  is  assigned  a  certain  probability  mass.  In  such  situations,  estimating  the 
joint  distribution  of  all  the  Comraf  nodes  becomes  even  less  possible.  A  somewhat 
traditional  approach  to  dealing  with  the  problem  of  combinatorial  explosion  is  to 
transfer  the  probabilistic  setup  to  the  continuous  space.  However,  it  is  well  known 
(see,  e.g.,  [86])  that  such  a  transformation  may  potentially  cause  significant  approxi¬ 
mation  errors.  Another  alternative  is  to  apply  a  local  search  in  the  event  space  L  of 
Xc.  Yet  another  possibility  is  to  apply  more  sophisticated  combinatorial  optimization 
methods,  such  as  Branch  and  Bound  [67]. 1  In  this  thesis,  we  choose  the  local  search 
approach.  Let  us  proceed  with  definitions. 

Definition  3.3.1  A  transaction  is  an  elementary  operation  in  traversing  the  event 
space  L  of  a  combinatorial  r.v.  Xc. 

In  the  other  words,  a  transaction  is  a  jump  operation  between  neighboring  points 
in  the  event  space  L  (i.e.,  closest  values  of  Xc ).  For  each  particular  learning  task, 
the  event  space  of  a  combinatorial  r.v.  will  be  defined  differently,  and  so  will  be  a 
transaction.  For  now,  let  us  assume  that  we  know  how  to  move  from  one  value  of  Xc 
to  another. 

Definition  3.3.2  A  path  in  L  is  a  sequence  of  transactions.  A  path  is  called  advan¬ 
tageous  if  it  leads  to  a  more  likely  value  of  Xc,  otherwise  it  is  disadvantageous. 

In  a  Comraf  model  with  more  than  one  combinatorial  r.v.,  the  most  straightfor¬ 
ward  version  of  an  inference  algorithm  would  be  a  variation  of  the  Iterative  Condi¬ 
tional  Modes  (ICM)  method  [20].  ICM  optimizes  the  objective  (3.1)  for  each  node  of 
an  MRF  iteratively  (in  a  round-robin  fashion),  given  its  Markov  blanket.  A  possible 
drawback  of  this  approach  can  be  evidenced  when  the  linear  combination  (3.1)  is 

1Branch  and  Bound  has  been  used  for  (uni-modal)  clustering  by  Koontz  et  al.  [62],  however  it  is 
questionably  applicable  to  multi-modal  learning  due  to  its  high  computational  complexity. 
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Input: 

G  -  Comraf  graph  of  nodes  {If, . . . ,  X '£,}  and  edges  E 

P(Xj, . . . ,  Xm)  -  joint  probability  distribution  of  data,  factorized  over  G 

l  -  number  of  optimization  iterations 

Output: 

Most  likely  x  J  t , . . . ,  t 

Initialization: 

For  i  =  l,.,.,mdo 

Select  a  point  in  Li  to  be  an  initial  value  of  Xf 
Compute  the  initial  joint  P(x\  0, . . . ,  x^  0),  factorized  over  G 
Main  loop: 

For  j  =  1, ...  ,1  do 

Select  variable  Xf,  for  optimization 

Construct  advantageous  path  (x?,  •_1  — >  x\,  •)  in  L.t> 

For  all  i  /  *'  do  xfj  =  x?j_  1 

Algorithm  1:  A  template  of  an  MPE  procedure  in  Comrafs. 


taken  over  log-potential  functions  fi ,  which  are  intrinsically  different  in  their  mag¬ 
nitude  and/or  semantics  (such  that  explicitly  taking  their  linear  combination  might 
not  be  beneficial).  For  these  situations,  we  propose  another  version  of  an  inference 
algorithm,  called  clique-wise  optimization  (CWO),  which  is  a  variation  of  a  local  op¬ 
timization  method  in  an  MRF.  Similarly  to  ICM,  the  CWO  algorithm  iterates  over 
nodes  in  the  MRF.  For  each  node,  a  clique  that  contains  this  node  is  chosen  and  the 
objective  (3.1)  is  optimized  with  respect  to  the  chosen  clique  only,  i.e.  independently 
of  the  rest  of  the  model.  Sutton  and  McCallum  [102]  apply  a  similar  method  (called 
piecewise  training)  in  a  supervised  setting.  Bouvrie  [23]  proposes  to  approach  the 
multi-modal  clustering  problem  by  iteratively  applying  a  bi-modal  clustering  algo¬ 
rithm.  To  some  extend,  Bouvrie’s  method  can  be  considered  as  a  special  case  of 
CWO. 

A  template  pseudo-code  for  the  MPE  approximation  in  a  Comraf  is  given  in 
Algorithm  1.  For  each  combinatorial  r.v.  Xf  in  the  Comraf,  we  first  select  and  fix 
its  initial  value  as  a  point  in  the  event  space  L?;.  We  then  round-robin  over  each  Xf, 
for  which  we  search  for  an  advantageous  path  in  L*.  When  this  path  is  constructed, 
we  fix  its  destination  point  to  be  a  new  value  of  Xf  and  move  to  another  node.  We 
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repeat  this  procedure  l  times.  To  transform  this  template  into  an  actual  algorithm, 
we  need  to  make  the  following  choices: 

•  flow  to  select  initial  values  for  each  combinatorial  r.v.  in  the  Comraf. 

•  flow  to  determine  an  ordering  for  variables  in  the  optimization  procedure  (and 
an  ordering  of  cliques  in  CWO).  One  obvious  approach  is  a  plain  or  weighted 
round-robin,  but  more  sophisticated  choices  can  also  be  made. 

•  flow  to  construct  an  advantageous  path  in  L. 

We  will  address  these  points  in  the  following  chapters  of  this  thesis. 

3.4  Summary 

While  technically  being  graphical  models,  Comrafs  are  very  different  from  existing 
graphical  models:  all  Comraf  models  we  propose  in  this  thesis  are  small  models  with 
‘rich’  nodes,  while  existing  graphical  models  are  usually  large  models  with  ‘simple’ 
nodes.  No  existing  inference  techniques  are  applicable  to  Comrafs  (as  they  cannot 
deal  with  nodes  as  complex  as  combinatorial  r,v,’s),  so  we  have  developed  a  new 
inference  framework  for  Comrafs. 

The  major  advantage  of  Comrafs  over  existing  graphical  models  is  that  Comrafs 
provide  a  more  flexible  modeling  environment:  existing  models  are  able  to  model 
data  only  in  terms  of  the  graph  G,  while  their  objective  function  and  their  infer¬ 
ence  algorithm  are  generic  rather  than  task-specific.  Usually,  this  property  is  not 
considered  to  be  a  drawback  of  the  graphical  model  framework:  once  the  graph  G  is 
designed  for  a  certain  task,  it  is  straightforward  to  apply  an  existing  inference  method 
to  this  graph,  ffowever,  existing  inference  methods  are  approximations  to  the  NP- 
liard  inference  problem,  and  thus  make  various  assumptions  that  can  potentially  be 
inappropriate  for  the  particular  task  being  solved.  The  main  disadvantage  of  generic 
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inference  methods  is  that  they  make  the  same  assumptions  for  every  task.  A  practi¬ 
tioner  can  choose  one  of  a  handful  of  existing  inference  methods  (such  as  mean-field, 
variational  approximation,  belief  propagation,  Gibbs  sampling  etc.  [58])  for  her  task, 
some  of  which  can  work  better  for  this  task  while  some  can  work  worse,  but  none  is 
specific  for  the  task. 

Comrafs,  in  contrast,  have  three  degrees  of  freedom:  designing  the  graph  G,  the 
objective  and  the  inference  algorithm,  all  specific  for  the  task  in  hand.  And  as  we  will 
show  below,  this  flexibility  leads  to  constructing  models  that  demonstrate  excellent 
performance  on  various  unsupervised  and  semi-supervised  learning  tasks. 
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CHAPTER  4 


COMRAFS  FOR  MULTI-MODAL  CLUSTERING 


Multi-modal  (hard)  clustering  is  a  problem  of  simultaneously  constructing  m  par¬ 
titionings  of  m  data  modalities,  e.g.  of  documents,  their  words,  authors,  titles  etc. 
When  clustering  modalities  simultaneously,  one  can  overcome  the  statistical  sparse¬ 
ness  of  the  data  representation,  leading  to  a  dense,  smoothed  joint  distribution  of 
the  modalities  that  would  result  in  (hypothetically)  more  accurate  clusterings  than 
the  ones  obtained  when  each  modality  is  clustered  separately.  Based  on  our  previous 
work  [10,  12],  we  will  empirically  justify  this  hypothesis.  In  this  chapter,  we  pro¬ 
pose  a  Comraf  model  for  multi-modal  clustering  (for  motivation  and  discussion,  see 
Chapter  1).  Let  us  first  introduce  the  notation. 

Let  si,  S2, ...,%  be  a  dataset  of  N  i.i.d.  samples  drawn  from  some  discrete  distri¬ 
bution.  Let  X  =  {xi,X2,  be  the  set  of  n  unique  values  comprising  the  event 

space  from  which  samples  s*  are  drawn.  We  now  define  a  random  variable  X  such 
that  P( X  =  Xi)  is  given  by  the  empirical  frequencies  of  samples  with  value  Xi  in  the 
dataset  (i.e.,  X  has  a  multinomial  distribution  estimated  using  maximum  likelihood). 

Define  a  hard  clustering  xc  to  be  a  partitioning  of  X.  Let  Xc  =  { x\ ,  x%, ...,  xcK}  be 
the  combinatorial  set  of  all  K  partitionings  of  X,  where  K  is  exponential  in  the  size 
of  X.  We  will  refer  to  the  subsets  of  the  j-tli  partitioning  x\ j  as  {xj^Xj^, ...,  Xj^}- 
That  is,  the  first  subscript  in  xh%  is  the  index  of  a  particular  partitioning,  and  the 
second  subscript  is  the  index  of  a  subset  (a  cluster)  within  that  partitioning. 

Define  Xj  to  be  a  random  variable  over  the  subsets  (clusters)  in  a  partitioning 
x)j,  with  the  probability  of  selecting  a  cluster  defined  as  the  probability  of  selecting 
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any  one  of  its  members:  P(X  =  Xj}i)  =  )]l6j.,P(a;).  Finally,  define  Xc  to  be 
a  combinatorial  r.v.  with  the  event  space  Xc.  In  this  thesis,  we  shall  use  parallel 
notation  for  different  modalities  of  data,  replacing  the  “x’s”  in  the  above  notation 
with  variables  appropriate  for  the  data  source.  For  example,  Wj  would  represent  a 
specific  word  in  a  dataset,  w  would  be  a  cluster,  wc  would  be  a  partitioning  of  words, 
and  so  on. 


4.1  Choosing  an  objective  function 

As  discussed  in  Section  3.2,  interactions  between  combinatorial  r.v.’s  are  repre¬ 
sented  by  edges  in  a  Comraf  graph.  To  use  the  objective  from  Equation  (3.1),  we 
should  choose  relevant  cliques  in  the  Comraf  graph  and  define  log-potential  functions 
over  these  cliques.  To  make  the  inference  feasible,  we  consider  only  the  smallest 
cliques,  i.e.  adjacent  pairs.  Since  our  inference  objective  allows  us  to  use  complicated 
log-potential  functions  (see,  again,  Section  3.2),  we  use  the  mutual  information  (MI) 
between  r.v.’s  defined  over  values  of  adjacent  nodes.  Let  x ^  and  x )  be  such  values 
(particular  partitionings  of  two  modalities).  A  log-potential  is  then  defined: 


f(xix*)  =  i(xi-,xj)  =  j2n 


%i,i'  ?  'EjJ' 


log 


P(Xi, 


l'  1  %j,j 


V  ,3 


P{xi^)P{xjdf)' 


(4.1) 


Our  motivation  for  choosing  MI  as  a  log-potential  function  is  as  follows:  a  linear 
combination  of  MI  terms  has  traditionally  been  used  as  a  clustering  criterion,  both 
in  uni- modal  clustering  methods,  such  as  Information  Bottleneck  (IB)  [106],  and  in 
bi-modal  methods  [35].  Slonim  et  al.  [97]  generalize  the  IB  clustering  criterion  to  a 
multivariate  case:  in  place  of  mutual  information,  they  use  Multi-Information1 


I{X1-...-Xm)  = 


P(x. 


u,*i  1 


,  X 


log 


P(Xi 


.  Xj  P  , 
?  Lm->  <'rn ) 


P(xilA)  ■  ■  ■ 


(4.2) 


1For  alternative  definitions  and  discussions  on  Multi-Information,  see  [114,  54]. 
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which  naturally  factorizes  over  a  directed  graphical  model.  With  little  effort,  we 
can  show  that  Multi-Information  also  factorizes  over  a  tree-structured  undirected 
graphical  model,  reducing  to  a  sum  of  pairwise  MI  terms  defined  over  edges  of  the 
tree.  However,  in  the  case  of  an  arbitrary  Comraf  graph,  Multi-Information  cannot 
be  fully  factorized.  In  general,  objective  functions  based  on  high  order  statistics 
(including  Multi-Information)  are  problematic  for  loopy  Comraf  graphs.  From  a 
statistical  viewpoint,  it  is  not  clear  whether  we  can  extract  reliable  estimates  for  the 
full  joint  distribution  P(X\, . . . ,  Xm).  Still,  we  can  approximate  Multi-Information 
by  a  sum  of  pairwise  MI  terms.  Estimating  the  quality  of  such  an  approximation 
remains  an  open  question. 

Thus,  substituting  log-potentials  (4.1)  into  the  MPE  inference  model  (3.1),  our 
objective  function  for  multi-modal  clustering  with  Comrafs  is  then: 

xc*  =  argmaxP(xc)  =  argrnax  /(X^A^).  (4.3) 

(Xf,x=,)eE 

This  maximization  is  performed  subject  to  constraints  on  the  cardinalities  /q  =  A',; | , 
i  =  l,...,m  (i.e.,  the  number  of  clusters  is  fixed).  Without  these  constraints,  the 
maximization  would  lead  to  a  degenerative  case  of  all  singleton  clusters.  Note  that 
these  constraints  do  not  necessarily  imply  the  use  of  a  flat  clustering  scheme  (see 
Chapter  2).  In  a  particular  clustering  algorithm,  clusters  can  be  split  or  merged, 
after  which  the  number  of  clusters  is  fixed  and  the  optimization  of  the  objective 
function  (4.3)  is  performed. 

We  apply  the  ICM  scheme  (see  Section  3.3)  to  multi-modal  clustering:  we  iterate 
over  combinatorial  r.v.’s  in  the  Comraf  graph,  and  at  each  iteration  (over  node  A?) 
we  construct  the  most  likely  clustering  xf’  by  optimizing  the  objective  function  (4.3). 
It  is  important  to  note  that  in  the  general  case  the  objective  (4.3)  has  0(|X|2)  terms. 
However,  at  each  ICM  iteration  only  one  node  is  optimized,  therefore  the  objective 
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Figure  4.1.  A  Comraf  graphs  for:  (a)  hard  version  of  Information  Bottleneck;  (b) 
information-theoretic  co-clustering;  (c)  one  of  the  possible  4-modal  Comrafs. 


is  reduced  to: 

XT  —  arg  max  /(A*;  A*/)  (4.4) 

**  V'-  (XPyX^)e e 

that  sums  over  only  0(|X|)  neighbors  of  Xf  (i.e.  its  Markov  blanket). 

The  resulting  model  has  two  important  special  cases: 

•  A  hard  version  of  Information  Bottleneck  [106].  In  Information  Bot¬ 
tleneck,  given  two  modalities  X  and  y,  a  clustering  xc*  is  constructed  that 
maximizes  information  about  Y  (and  minimizes  information  about  A"): 

xc*  =  arg  max  (/(Ay  Y)  -  f3I{Xj\  A))  ,  (4.5) 

where  (3  is  a  Lagrange  multiplier.  The  compression  constraint  I  (Ay  A)  can 
be  omitted  if  the  number  of  clusters  is  fixed:  Aj  =  k.  Consider  graph  G  in 
Figure  4.1(a),  where  a  shaded  Yc  represents  an  observed  variable.2  Over  the 
only  clique  in  G,  we  define  one  log-potential  which  is  the  mutual  information 
I{Xj\Y).  The  MPE  optimization  objective  for  such  Comraf  is  then: 

xc*  =  arg  max  P( A,  yc)  =  arg  max  I(Xj',  A), 

Xc-  ^  Xc- 

J  3 

2For  discussion  on  observed  variables  see  Chapters  5  and  7. 
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subject  to  \Xj\  =  k,  which  is  clearly  equivalent  to  the  Information  Bottleneck 
objective  (4.5). 

•  Information-theoretic  co-clustering  [35]  is  a  task  of  simultaneously  cluster¬ 
ing  two  modalities  X  and  jh,  while  minimizing  the  information  loss  /( X;Y)  — 
I(Xj,Yj )  under  the  constraint  |a^|  =  k\  and  \y?\  =  k2.  Note  that  I(X;Y ) 
is  a  constant  for  a  given  dataset.  This  scheme  is  a  special  case  of  a  Comraf 
as  well:  given  graph  G  in  Figure  4.1(b),  in  analogy  to  the  Comraf  model  of 
Information  Bottleneck,  we  define  the  only  log-potential  /(Xy  Yj).  Then  the 
information-theoretic  co-clustering  can  be  represented  as  an  MPE  inference  in 
this  Comraf: 


(xc*,yc*)  =  argrnax  P{xc^  yj)  =  argmax/(Xy  V)). 

X-.V-  X- .11 

3^3  3,tf3 

4.2  Clustering  as  combinatorial  optimization 

Given  a  variable  X  with  n  values  clustered  into  k  clusters,  the  combinatorial 
r.v.  Xc  has  kn  values,  all  of  which  can  be  represented  as  points  in  an  n-dimensional 
lattice  L :  a  point  xc  =  (?i,  i2, . . . ,  in)  corresponds  to  the  fact  that  value  aq  of  A" 
belongs  to  the  ?i-th  cluster,  value  x2  belongs  to  the  ?2-th  cluster,  . . .,  value  xn  belongs 
to  the  in- th  cluster.3  In  the  lattice  L  there  is  a  (possibly  non-unique)  point  xc*  = 
(ij,  Zg, . . . ,  i*)  which  is  most  likely.  Since  the  lattice  consists  of  an  exponential  number 
of  points,  the  task  of  hireling  the  most  likely  point  can  be  computationally  hard. 
In  lattice  L ,  a  transaction  (see  Definition  3.3.1)  is  interpreted  as  an  operation  of 
transferring  a  value  Xj  from  cluster  Xi  to  cluster  ay/,  i.e.  (. . . ,  ij, . . .)  — >  (. . . ,  i'-, . . .), 
where  ij  ^  i'j. 


3Recall  that  we  consider  only  hard  clustering:  P(xi  \xj)  =  1,  that  is,  a  value  Xj  is  assigned  only 
to  the  q-th  cluster. 


24 


Note  that  we  can  view  both  splits  and  mergers  of  clusters  as  transactions.  A  split 
of  a  cluster  iy  is  a  transaction  (. . . ,  iy, . . .)  — >  where  3 j  7^  j'  :  iy  =  ij 

and  Vj  7^  j'  :  i'y  7^  ij.  That  is,  cluster  iy  contained  at  least  two  elements  ( Xj 
and  Xj'),  one  of  which  ( Xj /)  has  been  transferred  into  a  newly  created  cluster  i'-,. 
A  merger  of  clusters  iy  and  i'y  is  a  transaction  (. . . , iy, . . .)  —>  where 

3 j  7^  j'  :  i =  ij  and  Vj  7^  j'  :  iy  7^  ij,  i.e.  cluster  iy  contained  only  one  element 
that  has  been  added  to  the  existing  cluster  i'-,  so  that  the  cluster  iy  does  not  exist 
anymore.  These  operations  will  help  us  to  represent  both  agglomerative  (bottom-up) 
and  divisive  (top-down)  clustering  schema  as  inference  in  Comrafs. 

By  applying  splits,  mergers  and  other  transactions,  we  construct  a  path  in  the 
lattice  L.  Our  goal  is  to  make  this  path  as  advantageous  as  possible,  such  that  a 
clustering  at  the  end  of  this  path  will  be  the  most  probable  clustering  that  could  be 
found.  Thus,  we  view  the  process  of  clustering  a  set  X  as  an  MPE  approximation  of 
a  combinatorial  r.v.  Xc,  where  the  MPE  is  approximated  using  a  local  search  in  the 
lattice  L.  To  perform  the  local  search,  we  apply  the  simplest,  greedy  combinatorial 
optimization  method — hill  climbing-,  at  each  ICM  iteration,  we  attempt  to  construct 
the  most  advantageous  path  in  L,  given  the  available  computational  resources. 

Let  us  discuss  particular  algorithms  in  more  detail  in  the  next  section. 

4.3  Multi-way  Distributional  Clustering  (MDC) 

In  this  section  we  describe  our  scheme  for  clustering  m  modalities  that  aims  at 
maximizing  our  objective  function  (4.3).  This  scheme  is  called  Multi-way  Distribu¬ 
tional  Clustering  (MDC)  [10].  Let  G  be  a  Comraf  graph  over  combinatorial  random 
variables  X ?,  i  —  1, . . . ,  m.  For  each  edge  etl/  in  graph  G  we  are  given  a  contingency 
table  Tn'  that  provides  the  corresponding  co-occurrence  counts  of  the  modalities  X 3 
and  Xi' .  The  input  to  the  algorithm  is  the  graph  G,  the  tables  Tay  as  well  as  m 
desired  cardinalities  k\, . . . ,  km  of  the  final  partitionings,  and  a  clustering  schedule 
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(the  sequence  of  variables  for  optimization  in  the  ICM  loop,  see  below  for  details). 
The  output  of  the  algorithm  is  m  partitionings  Xt,  %  —  1, . . .  ,m,  each  of  cardinality 

h  =  \xt\. 


The  desired  cardinalities  are  essential  parameters  of  MDC,  as  our 

method  cannot  infer  them.  We  believe  that  the  problem  of  inducing  the  optimal 
number  of  clusters  is  generally  ill-defined:  imagine  a  dataset  that  is  situated  on  a 
plane  in  a  triangle,  each  corner  of  which  consists  of  three  triangles  of  data  instances 
(27  instances  overall).  It  is  hard  to  decide  what  the  best  number  of  clusters  would  be 
in  this  case:  three  or  nine.  Admittedly,  not  all  machine  learning  researchers  would 
agree  with  this  argument.  Some  existing  clustering  methods  attempt  to  solve  the 
problem  of  optimal  number  of  clusters  (such  as,  e.g.,  the  Chinese  Restaurant  Pro¬ 
cess  [30]).  Still  and  Bialek  [101]  come  up  with  the  optimal  number  of  clusters  in  an 
Information  Bottleneck  setting.  While  their  method  is  well  justified  theoretically,  it 
could  not  induce  a  meaningful  number  of  clusters  in  our  experiments. 

To  compute  the  objective  function  (4.3)  we  will  need  the  following  definitions 
and  identities,  where  for  the  current  discussion  we  re-notate  X  =  Xt,  Y  =  X3  and 


nxy  =  t0c2/); 


x£.Y;  y& 


(4.6) 


where  p(x)  =  fill.  y),  and  p(y)  =  E KxP(x,y). 

Pseudo-code  for  the  multi-way  distributional  clustering  (MDC)  algorithm  is  given 
in  Algorithm  2.  For  simplicity,  the  pseudo-code  abstracts  away  several  details  that 
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Input: 

G  -  Comraf  graph  of  nodes  {X%, . . . ,  X '£,}  and  edges  E 

Tjj/  -  contingency  tables  for  each  ew  £  E 

Sup,  Sdown  -  bottom-up/top-down  partition  of  {1, . . . ,  to} 

Si  =  ii,  %2,  ■  ■  ■  ,ii  ~  clustering  schedule,  where  each  i:j  £  {1, ,  to} 

Output: 

Most  likely  clusterings  x ^  t 

Initialization: 

For  each  *  =  1, . . . ,  to  do 

If  i  £  Sdown  then 

Place  all  values  of  Xi  in  one  cluster 
Else  If  i  £  Sup  then 

Place  each  value  of  Xi  in  a  singleton  cluster 

Main  loop: 

For  each  q  from  Si  do 
Split/merge  phase: 

If  ij  £  Sdown  then 

Split  each  cluster  in  x\  uniformly  at  random  to  two  halves 
Else  If  ij  £  Sup  then 

Merge  each  cluster  in  £+  with  its  closest  peer 

Optimization  phase: 

For  each  values  a;  of  Xi  do 
Pull  x  out  of  its  current  cluster 

Place  x  into  a  cluster,  s.t.  objective  function  (4.4)  is  maximized 

Algorithm  2:  Multi-Way  Distributional  Clustering  (MDC). 


are  not  essential  for  understanding  the  general  idea  but  can  be  important  for  actual 
applications.  We  now  discuss  the  algorithm  and  then  provide  the  necessary  details.4 

The  main  loop  of  the  algorithm  is  controlled  by  two  parameters: 

•  Partition  (Sup,  Sdown )  of  the  set  of  variable  indices.  If  i  £  Sup,  then  the  variable 
Xj  is  clustered  using  a  bottom- up  procedure.  Otherwise  (i.e.  i  G  Sdown),  X%  is 
clustered  via  the  top-down  procedure. 

•  Clustering  schedule  Si  —  i\, . . .  ,ii ,  which  is  a  sequence  of  variable  indices.  The 
schedule  Si  determines  the  order  of  processing  the  variables.  While  this  mecha¬ 
nism  allows  for  great  flexibility,  we  always  apply  it  in  a  straightforward  manner 
where  the  sequence  Si  specifies  a  (weighted)  round-robin  schedule.  For  exam¬ 
ple,  in  the  case  of  bi- modal  clustering  (with  two  variables  X\  and  X2),  we 


4An  efficient  C++  implementation  of  MDC  that  was  used  in  our  experimental  study  can  be 
downloaded  from  http://sourceforge.net/projects/comraf. 
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Figure  4.2.  A  schematic  view  of  bi-modal  MDC  with  a  simple,  non-weighted  round- 
robin  schedule.  At  each  iteration  black  clusters  are  split  and  then  white  clusters  are 
merged. 

take  (ignoring,  for  the  moment,  the  desired  cluster  cardinalities)  Sdown  =  {1}, 
Sup  =  {2}  and  Si  =  1,  2, 1,  2, . . . ,  1,  2.  A  schematic  view  of  MDC  (for  this 
bi-modal  instance)  is  given  in  Figure  4.2. 

We  propose  two  versions  of  the  optimization  phase  of  our  algorithm:  sequential 
and  shuffled : 

•  In  the  sequential  version,  we  iterate  over  all  values  x%  of  Xtl  in  a  random  order 
(determined  via  a  permutation  selected  uniformly  at  random).  We  assign  Xi 
into  its  “best”  cluster,  i.e.  such  cluster  that  the  objective  from  Equation  (4.4) 
is  maximized.  Note  that  this  optimization  routine  is  similar  to  and  inspired  by 
the  sequential  Information  Bottleneck  (sIB)  clustering  algorithm  [96].  We  then 
iterate  over  all  the  values  of  Xt  once  again,  in  order  to  further  optimize  the 
objective,  i.e.  two  optimization  passes  are  performed  overall. 

•  In  the  shuffled  version,  we  repeat  the  following  step  a  predefined  number  of 
times:5  we  uniformly  at  random  select  a  data  point  Xi  and  a  cluster  Xj,  and 
assign  Xj  into  Xj  if  this  transaction  increases  the  value  of  our  objective.  The 


5 We  set  it  equal  (for  fair  comparison)  to  the  number  of  iterations  in  the  sequential  version. 


shuffled  approach  opens  the  door  to  improving  scalability  of  MDC  (the  number 
of  iterations  is  constant  and  can  be  chosen  arbitrarily  small,  at  the  cost  of 
decreasing  performance)  and  to  parallelization. 

Note  that  both  sequential  and  shuffled  procedures  can  never  decrease  the  objective 
function.  However,  cluster  mergers  usually  decrease  it,  so  the  optimization  is  lion- 
convex  in  the  general  case. 

The  choice  of  index  partition  ( Sup ,  Sdown )  is  based  on  the  following  two  cru¬ 
cial  observations.  First,  for  practical  applications  it  is  computationally  infeasible 
to  apply  bottom-up  procedures  for  all  the  variables.  Second,  applying  only  top- 
down  procedures  is  likely  to  be  useless,  in  terms  of  the  clustering  quality.  This 
is  easy  to  see  when  considering  bi-modal  applications,  with  respect  to  two  vari¬ 
ables  A"  and  Y.  The  objective  function  reduces  to  I(X\Y)  and  we  start  with 
xc  and  yc  each  being  a  single  cluster  containing  all  points.  Clearly,  in  this  case 
I(X-,Y )  =  0.  We  now  split  A"  to  get  A"  =  { x\ ,£2}.  For  any  (sq,  a^-partition 
we  have  H(Y\X)  =  —  ^2ip(xi,Y)logp(Y'\xi)  =  0,  since  p{Y\xi)  =  1.  Therefore, 
/(A;  Y)  =  H(Y )  —  H(Y\X)  =  H(Y )  =  0,  and  the  corrective  step  of  the  algorithm  is 
useless  here.  The  subsequent  split  of  Y  strictly  optimizes  the  objective  function,  but 
the  resulting  clustering  is  optimized  to  correlate  with  the  initial  random  split  of  the 
X  variable.  This  way,  all  the  subsequent  partitions  are  optimized  with  respect  to  a 
meaningless,  random  partition.  A  similar  argument  applies  to  the  general  MDC  and 
implies  that  at  least  one  of  the  clustering  procedures  must  not  be  computed  top-down. 
A  natural  choice  for  clustering  this  variable  would  be  a  bottom-up  method  because 
its  initialization  phase  (singleton  clusters)  does  not  require  any  prior  knowledge  to  be 
incorporated  (for  a  discussion,  see  Chapter  2). 

As  mentioned  above,  in  all  our  applications  we  construct  (weighted)  round  robin 
schedules  5)  —  A, . . . ,  q.  In  order  to  accommodate  the  required  cardinalities  ki, ...  ,km 
of  clusterings  . . . ,  x the  MDC  algorithm  performs  the  following  number  of  iter- 
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ations:  lz  =  [log/q]  for  i  G  SdOWn,  and  d  =  |~log(|Ad|//Cj)~|  for  i  G  Sup.  Thus,  each 
index  i  appears  d  times  in  the  sequence  Si,  in  a  (weighted)  round-robin  fashion. 

4.3.1  Computational  complexity  of  MDC 

We  now  analyze  the  time  complexity  of  the  sequential  version  of  MDC6  for  a  non- 
weighted  round-robin  schedule.  The  complexity  issue  should  be  taken  into  account 
when  forming  the  partition  ( Sup ,  SdOWn),  because  the  time  complexity  of  the  algorithm 
depends  on  u  =  | | ,  i.e.  on  the  number  of  modalities  clustered  bottom-up.  Let 
|X|  =  max(|Ad|, . . . ,  |Xm|),  the  size  of  the  largest  support  of  variables  Ad, . . . , Xm. 
At  each  iteration,  (sequential)  MDC  performs  three  nested  loops: 

1.  Pass  over  each  value  of  A*:  0(|A"|)  times; 

2.  For  each  value  of  Ad,  pass  over  each  cluster  in  Xt:  |Xj|  =  0(|A"|)  times; 

3.  For  each  cluster  in  Xt,  pass  over  clusters  in  all  the  other  clusterings  (excluding 
Ad  itself):  0(m\X\)  =  0(|A"|)  times  (the  number  of  clustered  variables,  m,  is  a 
constant  in  our  case). 

Since  the  number  of  iterations  is  n  =  0(log  |X|),  in  the  worst  case  (when  u  >  1) 
the  time  complexity  is  0(n|A"|3)  =  0(|A"|3  log  |A|).  This  complexity  can  be  burden¬ 
some  in  some  real-world  applications.  Note,  however,  that  for  each  variable  X,;,  which 
is  clustered  top-down,  at  each  iteration  j  the  number  of  clusters  is  Xt]  =  0{ki )  = 
0(1).  Thus,  when  u  —  1,  either  loop  2  or  loop  3  is  performed  0(1)  times,  and  the 
overall  running  time  is  0(|A"|2  log  |X|),  which  is  affordable  for  many  applications. 

In  the  bi-modal  case,  at  each  iteration  the  size  of  one  clustering  is  doubled,  and 
at  the  next  iteration  the  size  of  the  other  clustering  is  halved.  Therefore,  at  each 

6The  complexity  of  our  implementation  of  the  shuffled  version  is  the  same  as  the  one  of  the 
sequential  version,  because  we  choose  to  fix  the  number  of  iterations  in  the  shuffled  version  equal  to 
the  number  of  iterations  in  the  sequential  version. 
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iteration  [A^l  •  \X2\  <  2\X\,  i.e.  the  constant  under  the  ‘big-O’  is  only  2.  The  (non- 
hicrarchical)  co-clustering  algorithm  of  [35]  has  the  same  complexity  0(|A^|2  log  |Aj), 
but  with  a  larger  constant  under  the  ‘big-O’. 

Based  on  this  analysis,  in  all  our  experiments  we  fix  u  =  1,  i.e.,  only  one  variable  is 
clustered  bottom-up.  Finally,  note  that  if  variable  Xt  has  a  small  support,  \Xi\  <C  |AT|, 
then  the  decision  whether  i  G  Sup  or  i  G  SdOWn  can  be  made  independently  of  time 
complexity  considerations. 

4.4  Clique-wise  MDC 

As  we  discussed  in  Section  3.3,  global  optimization  of  the  objective  function  (3.1) 
is  not  always  beneficial.  As  an  alternative,  we  proposed  a  clique-wise  optimization 
(CWO)  procedure.  In  this  section,  we  propose  a  clique-wise  version  of  the  MDC 
algorithm,  which  is  inspired  by  Bouvrie’s  algorithm  [23].  Its  pseudocode  is  given 
in  Algorithm  3.  To  keep  the  procedure  as  simple  as  possible,  we  consider  only  the 
smallest  cliques,  i.e.  edges  in  the  Comraf  graph  G.  In  contrast  to  the  original  MDC 
that  iterates  over  nodes  in  G,  the  CWO  version  iterates  over  edges ,  in  a  round-robin 
fashion.  For  every  edge  eu /,  the  algorithm  performs  the  MPE  optimization  of  a 
portion  of  G  that  consists  of  only  one  edge  e**/  and  its  vertices  X 1  and  X?,.  This 
optimization  is  performed  independently  of  the  rest  of  the  model.  The  best  values  of 
Xf  and  Xf,  found  during  this  optimization  step  are  then  used  as  initial  values  for  the 
next  optimization  steps. 

In  this  setup,  an  application  of  hierarchical  clustering  appears  unnatural:  after 
the  j-th  optimization  iteration  over  one  clique,  the  constructed  clusterings  and 
x% j  are  supposed  to  have  the  desired  number  of  clusters  (kt  and  h#  respectively). 
Using  these  clusterings  as  initial  values  of  further  optimization  steps  leaves  no  room 
for  exploring  the  clustering  hierarchy.  For  this  reason,  and  also  for  simplicity,  at  each 
optimization  step  we  apply  a  flat  clustering  method,  similar  to  the  sequential  Infor- 
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Input: 

G  -  Comraf  graph  of  nodes  {X%, . . . ,  X and  edges  E 

Tu t  -  contingency  tables  for  each  ew  £  E 

ki , . . . ,  km  -  the  desired  number  of  clusters  for  each  node 

S[  =  {iii'i),  . . . ,  (iii'i)  ~  clustering  schedule,  where  each  pair  ii’  corresponds  to  edge  ew 

Output: 

Most  likely  clusterings  x\  . . . ,  x ^  t 

Initialization: 

For  each  i  =  1, . . . ,  m  do 

Distribute  all  values  of  Xi  uniformly  at  random  over  kt  clusters 

Main  loop: 

For  each  (iji'j)  from  S [  do 
For  each  value  x  of  X,  do 

LJ 

Pull  x  out  of  its  current  cluster 
Place  x  into  a  cluster,  s.t.  / (X7j ;  X,> )  is  maximized 
For  each  value  x  of  X,>  do 

Pull  x  out  of  its  current  cluster 

Place  x  into  a  cluster,  s.t.  / ;  X,/ )  is  maximized 

Algorithm  3:  Clique-wise  MDC. 


mation  Bottleneck  [96].  Quite  surprisingly,  the  results  of  this  flat  clustering  procedure 
are  comparable  to  the  ones  of  the  original  (more  complex)  MDC  (see  Section  4.6.5). 

The  computational  complexity  of  each  sequential  optimization  step  is  0(k2\X\), 
where  |A|  is  the  size  of  the  largest  support  among  the  variables  in  X,  and  k  is  the 
largest  final  number  of  clusters.  The  number  of  iterations  is  0(m2),  as  the  number  of 
edges  in  graph  G  is  in  the  worst  case  quadratic  in  the  number  of  combinatorial  random 
variables.  The  resulting  complexity  is  then  0(m2k2\X\),  which  is  asymptotically 
linear  in  the  size  of  the  data.  However,  the  constants  can  be  very  large.  Still,  in 
practical  cases,  Comraf  models  are  very  compact  such  that  the  m2  constant  is  not 
restrictive,  and  the  clique-wise  MDC  is  substantially  faster  than  its  original  ICM- 
based  version. 


4.5  Related  work 

The  study  of  distributional  clustering  based  on  co-occurrence  data  using  informa¬ 
tion  theoretic  objective  functions  was  initiated  by  Pereira  et  al.  [88].  Much  of  the 
subsequent  related  work  is  inspired  by  that  paper  and  the  Information  Bottleneck 
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(IB)  ideas  of  Tishby  et  al.  [106].  In  this  context,  the  first  work  considering  two-way 
clustering  of  both  words  and  documents  is  by  Slonim  and  Tishby  [99],  which  is  subse¬ 
quently  improved  by  El-Yaniv  and  Souroujon  [39],  and  then  more  thoroughly  studied 
by  Dhillon  et  al.  [35]. 

The  more  general  Multivariate  Information  Bottleneck  (mIB)  framework  [97]  also 
considers  simultaneous  clusterings  based  on  interaction  between  variables,  as  we  pro¬ 
pose  here.  For  two  variables  (bi-modal  clustering)  the  algorithm  proposed  here  can 
be  viewed  as  a  particular  implementation  of  the  “hard  case”  of  mIB.  However,  for 
more  than  two  variables,  the  framework  we  propose  here  is  not  a  special  case  of  the 
mIB  framework  since  the  interactions  between  variables  in  mIB  are  described  via 
a  directed  Bayesian  network,  in  which  cycles  cannot  be  factorized  to  pairwise  de¬ 
pendencies  (see  Section  4.1).  Our  scheme  employs  undirected  graphs  that  represent 
pairwise  interactions,  and  therefore  do  not  preclude  loops.  It  is  important  to  note 
that  our  clustering  algorithm  (MDC)  is  inspired  by  the  sequential  IB  method  [96]. 
Finally,  we  note  that  the  idea  of  multi-modal  clustering  also  appears  in  Bouvrie  [23], 
where  multiple  clusterings  are  constructed  by  an  iterative  application  of  a  bi-modal 
clustering  algorithm,  and  the  resulting  system  is  applied  to  computer  vision  tasks. 

4.6  Experimentation:  email  clustering 

In  this  section,  we  present  our  experimental  results  on  the  document  clustering 
task.  Two  particular  tasks  we  consider  are  similar  to  each  other:  (a)  automatic 
categorization  of  email  into  folders;  (b)  automatic  routing  of  newsgroup  messages 
into  appropriate  newsgroups. 

Email  foldering  is  a  rich  and  multi-faceted  problem,  with  many  difficulties  that 
make  it  different  from  traditional  topic-based  categorization.  Email  users  create  new 
folders,  and  let  other  folders  fall  out  of  use.  Email  folders  do  not  necessarily  corre¬ 
spond  to  simple  semantic  topics — sometimes  they  correspond  to  unfinished  todo  tasks, 
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project  groups,  certain  recipients,  or  loose  agglomerations  of  topics.  It  is  important 
to  note  that  email  content  and  foldering  habits  differ  drastically  from  one  email  user 
to  another — so  while  automated  methods  may  perform  well  for  one  user,  they  may 
fail  horribly  for  another.  In  this  thesis,  however,  we  test  the  Comraf’s  performance 
on  email  clustering  under  a  simplifying  assumption  that  folders  roughly  correspond 
to  semantic  topics.  In  our  future  work,  we  will  adapt  our  clustering  system  to  specific 
needs  of  particular  users. 

Despite  the  fact  that  clustering  is  rarely  used  as  a  stand-alone  application — it 
is  usually  a  part  of  another,  more  global  task — we  choose  to  focus  on  evaluating 
the  quality  of  the  clustering  results  per  se,  i.e.  not  with  respect  to  the  global  task. 
This  way,  our  evaluation  is  not  skewed  by  various  aspects  of  a  particular  real-world 
problem. 


4.6.1  Evaluation  measure 

Following  [96,  35]  and  many  other  works,  we  use  micro-averaged  accuracy  for 
evaluation  of  our  clustering  methods.  Let  xc  be  a  clustering  of  the  data  X.  Let  T  be 
the  set  of  ground  truth  categories.  We  fix  the  number  of  clusters  to  match  the  number 
of  categories  \xc\  =  \T\  =  k.  For  each  cluster  Xj,  let  7 r{xj)  be  the  maximal  number 
of  Xj’s  elements  that  belong  to  one  category.  Then,  accuracy  Acc(£j,T )  of  a  cluster 
Xj  with  respect  to  C  is  defined  as  Acc(xj,T)  =  jq-(xj)/\xj\.  The  micro-averaged 
accuracy  of  the  clustering  xc  is: 


Accm(xc,  T) 


EjU  7 r{xj) 

Ek  1  ~  1 

i=ilxil 


EjLi  7 t{xj) 

\X\ 


(4.7) 


I11  Section  4.6.5  also  present  macro -averaged  accuracy  results,  where  the  macro- 
averaged  accuracy  is  defined  as: 


AccM{xc,T ) 


EjU  Acc(xj,  T) 
k 


(4.8) 
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Figure  4.3.  Comraf  graphs  for  2-modal,  3-modal  and  4-modal  Comrafs  used  in  our 
experiments.  We  consider  interactions  between  combinatorial  random  variables  that 
correspond  to  documents  Dc ,  words  Wc,  email  correspondents  Cc  and  email  Subject 
lines  Sc.  Note  that  we  use  only  tree-structured  models,  as  they  are  simpler  than 
loopy  models  and  on  the  email  foldering  task  they  show  comparable  results  to  those 
obtained  with  loopy  models  (see  Section  4.6.6  for  a  discussion).  In  Section  4.8  we 
present  a  result  when  a  loopy  model  is  significantly  superior  to  a  tree-structured  one. 


4.6.2  Datasets 

We  evaluate  the  Comraf  models  on  six  text  datasets.  In  addition  to  the  standard 
benchmark  20  Newsgroups  dataset  (20NG)  we  use  five  real-world  email  directories. 
Three  of  them  belong  to  participants  in  the  CALO  project'  and  the  other  two  belong 
to  former  Enron  employees.8 

On  the  20NG  dataset  we  apply  a  bi-modal  Comraf  where  the  modalities  are  mes¬ 
sages  (documents)  and  words.  CALO  and  Enron  datasets  are  particularly  useful  for 
evaluating  3-modal  and  4-modal  Comrafs.  Here  we  take  as  variables  (1)  messages; 
(2)  words;  (3)  people  names  associated  with  messages — we  consider  the  entire  list  of 
correspondents  (both  senders  and  recipients);  and  (4)  email  Subject  lines,  represented 
by  their  bags  of  words.  Comraf  graphs  for  the  three  setups  are  given  in  Figure  4.3. 

Table  4.1  provides  basic  statistics  of  the  six  datasets.  For  details  on  collecting  the 
CALO  and  Enron  data,  see  [14].  Below  we  briefly  describe  the  data  and  preprocessing 
steps  undertaken. 

'http : //www. ai . sri . com/project/CALO 

®The  Web  page  of  the  original  Enron  Email  Dataset  is  http://www.cs.cmu.edu/~enron.  Our 
preprocessed  Enron  email  directories  can  be  obtained  from  http://www.cs.umass.edu/~ronb/ 
enron_dataset . html. 
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Dataset 

Size 

Min/max 
class  size 

Number  of 
distinct  words 

Number  of 
correspondents 

Number  of 
classes 

CALO:acheyer 

664 

3/72 

2863 

67 

38 

CALO:mgervasio 

777 

6/116 

3207 

61 

15 

CALO:mgondek 

297 

3/94 

1287 

50 

14 

Enron  :  kitchen-l 

4015 

5/715 

15579 

2278 

47 

Enron:sanders-r 

1188 

4/420 

5966 

933 

30 

20NG 

19997 

997/1000 

39764 

N/A 

20 

Table  4.1.  Statistics  of  email  datasets.  Number  of  distinct  words  and  number 
of  correspondents  are  after  preprocessing. 


4.6. 2.1  20  Newsgroups 

The  20  Newsgroups  (20NG)  corpus  contains  19997  messages  taken  from  the  Usenet 
newsgroups  collection.9  Each  message  is  assigned  into  one  or  more  semantic  categories 
and  the  total  number  of  categories  is  20,  all  of  which  are  of  about  the  same  size.  Most 
of  the  documents  have  only  one  semantic  label,  however  it  turns  out  that  about  4.5% 
of  documents  have  two  or  more  labels.  Those  documents  are  simply  duplicated  in  the 
dataset  (one  copy  per  category).  In  this  thesis,  for  easier  replicability  of  our  results, 
we  decided  to  refrain  from  taking  steps  of  any  kind  to  resolve  the  duplication  issue. 

We  preprocess  the  20NG  dataset  as  described  in  [11],  First,  we  remove  message 
headers  and  markup  (such  that  only  the  subject  lines  and  actual  text  remained). 
Next,  we  filter  out  lines  that  seem  to  be  part  of  binary  files  sent  as  attachments 
or  pseudo-graphical  text  delimiters.  A  line  is  considered  to  be  a  “binary”  (or  a 
delimiter)  if  it  is  longer  than  50  symbols  and  contains  no  white  spaces.  Overall,  we 
remove  23057  such  lines  (most  of  them  appeared  in  a  handful  of  articles).  Finally, 
we  represent  documents  as  their  Bags-Of- Words,  lower  the  case  of  letters  and  remove 
stopwords  as  well  as  low-frequency  words. 

9http : / /kdd. ics .  uci . edu/databases/20newsgroups/20newsgroups .  html 
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4. 6. 2. 2  Enron  Email  Dataset 


The  archived  email  from  many  of  the  senior  management  of  Enron  Corporation 
was  subpoenaed,  and  is  now  in  the  public  record.  The  data  consists  of  over  500,000 
email  messages  from  the  email  accounts  of  150  people.  The  dataset  is  provided  by 
SRI  after  major  clean-up  and  removal  of  attachments.  The  dataset  version  we  use 
was  released  on  February  3,  2004. 

Although  the  size  of  the  dataset  is  large,  many  users’  folders  are  sparsely  popu¬ 
lated.  We  use  the  email  directories  of  two  former  Enron  employees:  KITCHEN-L  and 
SANDERS-R.  Those  directories  are  among  the  largest  ones  in  the  dataset. 

We  remove  standard  non-topical  folders  “alLdocuments” ,  “calendar” ,  “contacts”, 
“deleted-  items”,  “discussion-threads” ,  “inbox”,  “notesJnbox” ,  “sent”,  “sent-items” 
and  “sent  -mail”.  We  then  flatten  all  the  folder  hierarchies  and  remove  all  the 
folders  that  contain  fewer  than  three  messages.  We  also  remove  the  X-folder  field  in 
the  message  headers  that  actually  contains  the  class  label.  As  for  20NG,  we  finally 
represent  documents  as  their  Bags- Of- Words,  lower  the  case  of  letters  and  remove 
stopwords  and  low-frequency  words. 

4.6. 2.3  CALO  Email  Dataset 

A  smaller  but  also  significant  corpus  of  real-world,  foldered  email  has  been  created 
as  part  of  the  CALO  DARPA/SRI  research  project.  This  corpus  consists  of  snapshots 
of  the  email  folders  of  196  users,  containing  approximately  22,000  messages.  From 
the  February  2,  2004  snapshot  of  CALO  directories,  we  select  three  users  with  large 
number  of  messages:  ACHEYER,  MGERVASIO,  and  MGONDEK.  As  in  the  preprocessing 
step  of  the  Enron  datasets,  we  first  remove  standard  non-topical  folders  (  “Inbox” , 
“Drafts”,  “Sent”  and  “Trash”).  Then  the  folder  hierarchy  is  flattened,  and  folders 
that  contain  fewer  than  three  messages  are  removed.  Finally,  as  for  all  the  other 
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datasets,  we  represent  documents  as  BOW,  lowercase  the  text  and  filter  out  stopwords 
and  low-frequency  words. 

4.6.3  Baseline  algorithms 

We  compare  the  performance  of  Comraf  clustering  algorithms  with  the  following 
five  well  known  benchmark  clustering  algorithms: 

1.  K-means.  We  use  the  SimpleKMeans  implementation  of  WEKA  [112]; 

2.  Agglomerative  Information  Bottleneck  (alB).  A  simple,  deterministic 
uni-modal  Information  Bottleneck  clustering  algorithm  [98]; 

3.  Sequential  Information  Bottleneck  (sIB).  A  randomized  uni-modal  Infor¬ 
mation  Bottleneck  clustering  algorithm  [96],  which  exhibited  striking  perfor¬ 
mance  in  the  text  domain; 

4.  Information-theoretic  co-clustering  (ITCC).  A  bi-modal  clustering  algo¬ 
rithm  [35]  (see  Section  4.1); 

5.  Latent  Dirichlet  Allocation  (LDA).  A  popular  generative  model  for  repre¬ 
senting  document  collections,  proposed  by  Blei  et  al.  [22],  Each  document  is 
represented  as  a  distribution  of  topics,  and  parameters  of  those  distributions  are 
learned  from  the  data.  Documents  are  then  clustered  based  on  their  posterior 
distributions  (given  the  topics).  We  use  Xuerui  Wang’s  LDA  implementation 
[78]  that  applies  Gibbs  sampling  with  10000  sampling  iterations.10 

Note  that  the  latter  three  algorithms  are  widely  considered  to  be  state-of-the-art  in 
unsupervised  text  categorization. 


10We  also  tried  David  Blei’s  LDA-C  [22]  that  implements  variational  approximation  and  obtained 
significantly  inferior  accuracy. 
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To  gain  some  perspective  on  the  performance  of  the  unsupervised  methods  we 
tested,  we  also  report  on  the  results  of  a  trivial  “random  clustering”,  which  simply 
places  each  document  in  a  random  cluster.  At  the  other  extreme,  we  report  on 
the  categorization  accuracy  of  a  supervised  application  of  a  support  vector  machine 
(SVM),  applied  with  linear  kernel  and  with  cross- validated  parameter  tuning  (using 
the  same  setup  as  described  in  Bekkerman  et  al.  [11]).  We  stress  that  the  supervised 
categorization  accuracy  cannot  be  directly  compared  with  the  clustering  accuracy, 
however,  it  provides  some  perspective  on  on  datasets’  “complexities” . 

4.6.4  Implementation  details 

The  following  technical  details  are  important  for  replicating  our  experimental 
results: 

1.  Unless  stated  otherwise,  we  use  the  bottom-up  scheme  for  documents  and  the 
top-down  scheme  for  all  the  other  clusterings. 

2.  As  discussed  in  Section  4.3,  we  merge  each  document  cluster  with  its  closest 
peer.  Following  Slonim  &  Tishby  [98],  we  choose  the  Jensen- Shannon  divergence 
between  clusters  as  the  underlying  “metric” . 

3.  At  the  MDC’s  last  iteration  (at  which  the  required  number  of  document  clus¬ 
ters  is  obtained),  we  apply  the  optimization  routine  after  merging  each  pair  of 
clusters. 

4.  We  perform  10  random  restarts  at  each  iteration  of  MDC.  For  a  fair  comparison, 
we  perform  the  same  number  of  random  restarts  in  our  implementations  of  both 
sIB  and  ITCC  algorithms. 

5.  We  use  the  same  clustering  schedule  Si  for  every  dataset.  The  schedule  starts 
with  splits  of  top-down  clusters — as  discussed  in  Section  4.3,  it  cannot  start  with 
a  merger  of  document  clusters  otherwise  the  objective  function  (4.3)  would  be 
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0.  Also,  we  notice  that  it  is  not  beneficial  to  start  merging  document  clusters 
before  a  significant  number  of  word  clusters  is  obtained,  otherwise  the  objec¬ 
tive  function  is  still  too  close  to  0.  Thus,  before  doing  the  first  iteration  over 
documents,  we  perform  four  iterations  over  words,  and  continue  with  a  plain 
(non-weighted)  round-robin  then. 

4.6.5  Comparative  results 

Micro-averaged  accuracy  (averaged  over  ten  independent  runs,11  whenever  appli¬ 
cable)  for  the  six  datasets  is  reported  in  Table  4.2.  It  is  evident  that  the  results  of  our 
bi-modal  Comraf  clustering  (with  the  underlying  MDC  algorithm)  are  significantly 
superior  to  those  obtained  by  other  methods.  The  only  statistically  insignificant  im¬ 
provement  is  recorded  for  MDC  over  sequential  IB  on  the  CALC): ACHEYER  dataset; 
all  the  other  gaps  are  statistically  significant.  Of  particular  importance  is  the  striking 
69.5%  micro-averaged  accuracy  achieved  by  the  bi-modal  MDC  on  20NG.12  This  im¬ 
pressive  result  is  12%  higher  than  the  best  previously  reported  result  on  this  dataset. 
Specifically,  a  micro-averaged  accuracy  of  57.5%  on  20NG  is  reported  for  sIB  in  [96]. 
This  result  is  obtained  with  only  2,000  “most  discriminating”  words.  Also,  in  that 
work,  duplicated  and  small  documents  are  removed,  leaving  only  17,446  documents. 
In  our  implementation  of  sIB,  our  use  of  almost  40,000  words  leads  to  61%  accuracy 
on  the  entire  dataset  of  19,997  documents.  More  than  5%  absolute  improvement  is 
also  obtained  on  Enron:KITCHEN-l  and  CALO:mgondek  datasets. 

■^Randomized  algorithms,  such  as  MDC,  may  obtain  different  results  each  time  they  are  applied 
to  the  same  dataset.  We  perform  ten  independent  runs  of  each  randomized  algorithm  on  the  same 
data,  and  compute  the  mean  of  the  obtained  results,  as  well  as  the  standard  error  of  the  mean. 

12In  [10]  we  reported  on  a  slightly  better  result  of  MDC  on  20NG.  This  better  performance  was 
obtained  using  a  cluster  balancing  heuristic  that  reduced  the  probability  of  small  clusters  to  be 
further  split  and  of  large  clusters  to  be  further  merged.  Later  we  discovered  that  this  heuristic  is 
not  uniformly  effective  across  datasets  and  we  therefore  abandoned  it. 
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Method 

CALO: 

acheyer 

CALO: 

mgervasio 

CALO: 

mgondek 

Enron: 

kitchen-l 

Enron: 

sanders-r 

20NG 

Random 

17.8  ±0.5 

18.3  ±0.3 

32.4  ±0.1 

17.9  ±0.1 

35.4  ±0.1 

6.3  ±0.1 

K-means 

24.7 

24.1 

37.0 

29.6 

45.5 

OOM 

Agglom.  IB 

36.4 

30.9 

43.3 

31.0 

48.8 

26.5 

Sequent.  IB 

47.0  ±0.5 

35.1  ±0.6 

68.2  ±  1.2 

34.6  ±  0.5 

63.1  ±0.6 

61.0  ±0.7 

ITCC 

46.1  ±0.3 

34.2  ±0.5 

63.4  ±  1.1 

31.8  ±0.2 

60.2  ±0.4 

57.7  ±0.2 

LDA 

44.3  ±0.4 

38.5  ±0.4 

68.0  ±0.8 

36.7  ±0.3 

63.8  ±0.4 

56.7  ±0.6 

2-modal  Comraf 
(sequential) 

47.8  ±0.4 

42.4  ±0.4 

75.9  ±0.6 

42.4  ±0.6 

67.4  ±0.3 

69.5  ±  0.7 

2-modal  Comraf 
(shuffled) 

47.1  ±0.4 

44.0  ±  1.0 

75.5  ±0.5 

41.6  ±0.8 

67.6  ±  0.3 

67.2  ±0.8 

SVM 

(supervised) 

65.8  ±2.9 

77.6  ±1.0 

92.6  ±0.8 

73.1  ±  1.2 

87.6  ±1.0 

91.3  ±0.3 

Table  4.2.  Micro-averaged  accuracy  (±  standard  error  of  the  mean,  when  appli¬ 
cable)  on  the  six  datasets.  The  SVM  supervised  classification  accuracies  are  obtained 
with  4-fold  cross  validation.  “00M”  means  “out  of  memory”:  WEKA  was  unable  to 
cluster  20NG,  on  a  4GB  RAM  machine.  Bold  numbers  are  the  best  results  over  all. 


Surprisingly,  on  CALO  and  Enron  datasets,  the  sequential  version  of  MDC  and 
its  shuffled  version  obtain  almost  identical  results  (the  difference  is  statistically  in¬ 
significant).  Note  that  in  both  versions  we  perform  the  same  number  of  optimization 
steps.  However,  on  20NG,  sequential  MDC  is  significantly  superior.  This  can  be 
explained  by  the  fact  that  sequential  MDC  is  guaranteed  to  iterate  over  all  the  data 
instances,  while  shuffled  MDC  is  not.  On  smaller  datasets  (CALO  and  Enron),  the 
number  of  optimization  steps  is  large  enough  to  make  the  shuffled  version  iterate  over 
(almost)  every  data  instance.  On  a  larger  dataset  (20NG),  however,  shuffled  MDC 
is  less  likely  to  iterate  over  every  data  instance,  and  therefore  is  sub-optimal. 

Table  4.3  shows  macro-averaged  accuracy  results  on  CALO  and  Enron  datasets. 
Compared  with  micro-averaged  accuracy,  macro-averaged  accuracy  favors  smaller 
clusters  over  larger  clusters.  We  can  see  in  the  table  that  Comraf’s  sequential  MDC 
method  is  still  significantly  better  than  the  baselines  (here  we  show  the  results  of 
only  three  most  prominent  baselines:  sIB,  ITCC  and  LDA).  The  only  exception 
is  an  insignificant  improvement  MDC  achieves  over  sIB  on  the  ENRON :KITCHEN-L 
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Method 

CALO: 

acheyer 

CALO: 

mgervasio 

CALO: 

mgondek 

Enron: 

kitchen-l 

Enron: 

sanders-r 

Sequent.  IB 

57.4  ±0.7 

53.1  ±0.7 

65.9  ±0.6 

46.7  ±0.4 

69.2  ±0.7 

ITCC 

57.3  ±0.4 

50.0  ±  1.2 

67.0  ±0.8 

43.3  ±0.4 

65.6  ±0.4 

LDA 

53.0  ±0.6 

52.2  ±0.8 

63.6  ±0.7 

39.1  ±0.2 

66.7  ±0.2 

2-modal  Comraf 
(sequential) 

59.9  ±  0.5 

58.5  ±  0.7 

76.9  ±  0.8 

47.0  ±  0.7 

74.6  ±  1.1 

Table  4.3.  Macro-averaged  accuracy  (±  standard  error  of  the  mean)  on  CALO 
and  Enron  datasets.  Each  number  is  an  average  over  ten  independent  runs.  Bold 
numbers  are  the  best  results  over  all. 


dataset.  Note  that  the  macro-averaged  accuracies  shown  in  Table  4.3  are  in  most 
cases  higher  than  micro-averaged  accuracies  (Table  4.2).  This  implies  that  small 
clusters  constructed  by  the  discussed  clustering  methods  are  generally  cleaner  than 
large  clusters. 

As  shown  in  Table  4.4,  our  tri-modal  Comraf  (documents/words/correspondents) 
consistently  improves  the  bi-modal  Comraf  performance  on  the  CALO  email  datasets. 
On  mgervasio,  the  addition  of  correspondents’  modality  leads  to  an  impressive  ab¬ 
solute  improvement  of  10%.  On  Enron  email,  however,  tri-modal  Comraf  shows 
mixed  results:  a  significant  improvement  on  SANDERS-R  and  a  drop  on  KITCHEN- 
L.  A  closer  inspection  reveals  that  the  email  correspondent  input  stream  in  Enron 
datasets  is  extremely  noisy.  That  is,  the  information  on  the  same  person  can  be 
represented  in  dozens  of  different  formats,  delimiters  between  separate  records  are 
sometimes  non-existent,  and  many  email  messages  have  very  long  lists  of  recipients 
(which  would  probably  imply  that  email  data  not  always  strongly  correlate  with  the 
recipient  data). 

When  comparing  the  ICM  and  CWO  optimization  methods  for  Comrafs  (see  Sec¬ 
tion  3.3),  we  can  see  that  ICM  usually  outperforms  CWO.  However,  CWO  is  a  some¬ 
what  simpler  and  significantly  faster  method  (for  a  discussion,  see  Section  4.4). 

Our  experimentation  with  4-modal  Comraf  (documents/words/correspondents/ 
subject  lines)  on  the  CALO  datasets  shows  further  (insignificant)  improvement  over 
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Method 

CALO: 

acheyer 

CALO: 

mgervasio 

CALO: 

mgondek 

Enron: 

kitchen-l 

Enron: 

sanders-r 

2-modal  Comraf — ICM 

47.8  ±0.4 

42.4  ±0.4 

75.9  ±0.6 

42.4  ±  0.6 

67.4  ±0.3 

3-modal  Comraf — ICM 

49.1  ±0.4 

52.4  ±0.7 

80.1  ±0.7 

40.2  ±0.3 

69.0  ±  0.4 

3-modal  Comraf — CWO 

47.2  ±0.3 

48.4  ±0.5 

76.1  ±1.2 

39.5  ±0.5 

63.9  ±0.2 

4-modal  Comraf — ICM 

50.2  ±  0.6 

54.1  ±  0.5 

80.9  ±  0.5 

34.2  ±0.2 

63.1  ±0.4 

4-modal  Comraf — CWO 

47.6  ±0.2 

48.6  ±0.6 

78.7  ±1.1 

38.7  ±0.4 

63.4  ±0.4 

Table  4.4.  Micro-averaged  accuracy  (±  standard  error  of  the  mean)  on  CALO 
and  Enron  datasets.  Each  number  is  an  average  over  ten  independent  runs.  Comrafs 
models  are  2-modal,  3-modal  and  4-modal,  with  the  sequential  optimization  applied 
at  each  node.  Bold  numbers  are  the  best  results  over  all. 


the  tri-modal  Comraf  performance.  On  Enron,  in  contrast,  a  significant  drop  can  be 
observed.  An  important  observation  is  that  the  subject  line  modality  is  substantially 
sparser  than  other  modalities  in  the  Enron  datasets.  It  is  evident  that  the  addition  of 
a  sparse  modality  appears  to  be  non-benehcial  for  multi-modal  clustering.  A  formal 
method  for  learning  a  Comraf  model  structure  is  emerging,  which  we  leave  for  our 
future  work  (for  a  discussion,  see  Section  4.6.6  below). 

4.6. 5.1  Experimentation  with  clustering  schedule 

On  CALO  data,  we  test  another  algorithmic  setup  of  the  bi-modal  MDC,  in  which 
both  words  and  documents  are  clustered  bottom-up.  The  results  are  very  similar  to 
our  original  bi-modal  MDC  accuracies.  However,  this  setting  is  not  applicable  to 
larger  datasets:  taking  constants  into  account,  on  the  20NG  dataset  the  bottom-up 
version  of  MDC  would  be  300  times  slower  than  the  original  (top-down  /  bottom-up) 
MDC. 

In  addition,  we  test  a  reverse  clustering  schedule,  where  we  apply  bottom-up 
clustering  to  words  and  top-down  clustering  to  documents.  On  the  20NG  dataset, 
we  perform  five  splitting  iterations  over  documents  (obtaining  32  clusters)  and  then 
apply  the  last  exhaustive  clustering  iteration  as  explained  in  Section  4.6.4,  reduc¬ 
ing  the  number  of  clusters  to  20.  The  micro-averaged  clustering  accuracy  obtained 
by  the  reverse  schedule  is  69.3  ±  0.4%,  which  is  statistically  indistinguishable  from 


43 


0.5 


CALO:acheyer 


CALO:mgervasio 


0  0.5  1  2  3  4 

search  length 


0  0.5  1  2  3  4 

search  length 


0  0.5  1  2  3  4 

search  length 


0  0.5  1  2  3  4 

search  length 


Figure  4.4.  Clustering  accuracies  as  a  function  of  the  length  of  local  search  in 
sequential  MDC:  ‘0.5’  on  the  x-axis  means  that  the  MDC’s  optimization  routine  was 
executed  over  one  half  of  the  data  points  (chosen  uniformly  at  random),  while  ‘3’ 
means  that  the  optimization  routine  was  executed  over  every  data  point  3  times.  All 
our  results  are  averaged  over  10  independent  runs. 


the  the  original  MDC’s  performance.  Note  that  the  reverse  scheme  is  significantly 
faster  than  the  original  MDC  (8  clustering  iterations  vs.  21  iterations  on  20NG).  On 
email  datasets,  similar  results  are  obtained  in  three  of  the  five  cases,  whereas  in  two 
others  the  reverse  schedule  shows  significantly  poorer  performance  (3%  decrease  on 
CALO:mgervasio  and  7%  decrease  on  Enron :kitchen-l). 

4. 6. 5. 2  Experimentation  with  the  length  of  local  search 

Figure  4.4  presents  the  micro-averaged  clustering  accuracy  of  sequential  MDC 
(in  a  bi-modal  Comraf)  as  a  function  of  the  length  of  local  search  performed  in  the 
lattices  of  all  possible  word  and  document  clusterings.  Recall  that  in  Algorithm  2 
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we  perform  a  local  search  (i.e.  an  optimization  phase),  in  which  every  data  point  is 
sequentially  pulled  out  of  its  cluster  and  assigned  into  a  cluster  such  that  the  objective 
function  is  maximized.  In  their  sequential  IB  algorithm,  Slonim  et  ah  [96]  propose 
to  execute  such  an  optimization  routine  a  number  of  times,  up  to  the  convergence 
of  the  objective  function  to  its  local  maximum.  Their  approach  has  a  drawback 
of  a  potentially  unlimited  execution  time:  while  it  is  guaranteed  that  the  objective 
function  occasionally  converges,  it  is  uncertain  how  long  this  can  take. 

In  our  MDC’s  implementation,  we  perform  the  optimization  routine  twice  (see 
Section  4.3),  in  order  to  approach  the  local  maximum,  while  not  setting  our  stopping 
criterion  at  achieving  the  full  convergence.  In  this  section,  we  ask  the  question  whether 
or  not  the  length  of  the  local  search  is  a  crucial  parameter  of  our  system.  Our 
experiment  is  conducted  as  follows:  in  a  bi-modal  Comraf,  we  set  the  length  of 
the  optimization  routine  to  be  a  function  of  the  number  of  data  points  (words  or 
documents).  We  start  with  the  case  where  we  explore  only  one  quarter  of  the  data 
(chosen  uniformly  at  random),  then  we  try  one  half,  and  then  we  perform  from  1  to  4 
full  passes  (over  all  the  data  points).  We  perform  this  experiment  on  4  email  datasets 
(excluding  the  large  kitchen-l  and  20NG  collections). 

As  can  be  seen  on  Figure  4.4,  the  correlation  of  local  search  length  and  the  cluster¬ 
ing  accuracy  is  quite  weak,  as  soon  as  at  least  one  pass  over  all  the  data  is  performed. 
In  some  cases,  shorter  searches  are  quite  effective  (such  that  the  one  on  mgondek), 
while  in  the  others  (sanders-r)  a  significant  drop  is  recorded.  Searches  longer  than 
two  data  sizes  are  generally  not  beneficial:  while  a  (rather  insignificant)  improvement 
can  usually  be  seen,  the  run  time  increase  trades  off  against  this  improvement. 

Finally,  let  us  emphasize  that  we  approximate  a  local  maximum  of  our  objective 
function.  Following  Slonim  [95],  we  note  that  obtaining  a  global  maximum  is  very 
unlikely  in  our  non-convex  combinatorial  optimization  environment,  where  (in  the 
worst  case)  all  possible  configurations  should  be  tested  in  order  to  achieve  a  global 
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Clustering  accuracy  on  various  Pairwise  Interaction  Graphs 
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Figure  4.5.  Experimenting  with  various  Comraf  graphs  on  MGERVASIO. 

optimum.  Since  the  number  of  possible  configurations  is  astronomical  even  in  the 
smallest  real-world  setups,  approximating  a  global  maximum  is  practically  impossible. 

4.6.6  Model  analysis 

As  shown  in  Section  4.6,  multi-modal  clustering  based  on  more  than  two  entities 
may  or  may  not  improve  performance  relative  to  the  bi-modal  clustering.  Given  a 
Comraf  graph  (and  the  corresponding  pairwise  data),  an  interesting  question  is  which 
of  the  pairwise  interactions  can  contribute  useful  information  to  clustering  the  target 
variable. 

We  investigate  this  problem  with  respect  to  the  MGERVASIO  dataset.  Specifically, 
we  test  all  possible  Comraf  graphs  and  measure  their  effectiveness  in  accurately  clus¬ 
tering  the  target  variable.  Figure  4.5  summarizes  our  findings  (for  better  visibility, 
we  present  only  the  most  interesting  cases).  As  can  be  seen  at  the  figure,  the  choice 


46 


of  a  Comraf  graph  can  dramatically  affect  the  clustering  performance  (within  a  15% 
accuracy  range).  Also,  this  experiment  illustrates  the  fact  that  model  learning  is 
feasible  in  Comrafs  (which  usually  contain  a  small  number  of  nodes). 

Some  variables  can  be  crucial  for  obtaining  good  clustering  results,  while  some 
others  can  be  unnecessary  or  even  harmful.  For  example,  when  substituting  the 
words  variable  with  email  subjects,  a  decrease  in  the  results  can  always  be  seen  (nat¬ 
urally,  email  bodies  provide  more  information  than  email  subjects).  In  contrast,  the 
correspondents  variable  plays  a  positive  role  in  foldering  email  of  MGERVASIO.  Some¬ 
what  surprisingly,  the  bi- modal  documents/correspondents  clustering  setup  leads  to 
a  6%  absolute  improvement  over  the  ordinary  documents/ words  setup.  A  possible 
explanation  is  that  most  of  the  folders  in  this  dataset  are  created  according  to  people 
groups  in  the  email  owner’s  social  network. 

Some  interactions  are  more  important  than  others.  For  example,  in  the  docu¬ 
ments/correspondents/titles  triangle,  a  missing  documents/correspondents  interac¬ 
tion  can  cause  a  10%  drop  in  the  accuracy.  However,  when  crucial  interactions  are 
selected,  adding  other  interactions  would  not  significantly  affect  the  performance, 
but  rather  will  add  a  certain  computational  burden.  Therefore,  a  desirable  goal 
would  be  to  select  only  crucial  interactions,  which  are  the  ones  presented  in  Fig¬ 
ure  4.3.  When  using  the  CWO  inference  method,  however,  constructing  the  full 
Comraf  graph  is  sometimes  beneficial.  For  example,  in  a  tri-modal  setting,  the  ad¬ 
dition  of  the  correspondents/ words  interaction  leads  to  a  significant  improvement 
in  the  (micro-averaged)  document  clustering  accuracy  on  three  of  the  five  datasets: 
51.1  ±0.4%  vs.  48.4 ±0.5%  on  mgervasio,  42.2 ±0.4%  vs.  39.5 ±0.5%  on  kitchen-l, 
and  68.8  ±  0.2%  vs.  63.9  ±  0.2%  on  sanders-R. 
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4.6.7  Multi-modal  clustering  for  social  network  analysis 

The  goal  of  multi-modal  clustering  presented  in  this  chapter  can  be  not  only  to 
document  clustering,  but  also  word  clustering  or  clustering  of  people  for  the  purposes 
of  social  network  analysis.  We  apply  our  tri-modal  Comraf  to  simultaneously  clus¬ 
ter  email  messages,  their  words  and  correspondents,  and  evaluate  the  quality  of  the 
constructed  clusters  of  email  correspondents.  To  obtain  the  ground  truth  data,  we 
asked  Dr.  Melinda  Gervasio,  the  creator  of  the  CALO:MGERVASIO  email  directory, 
to  classify  her  61  correspondents  to  semantic  groups.  She  created  four  categories: 
SRI  management,  SRI  CALO  collaborators,  non-SRI  CALO  participants  and  other 
SRI  people  not  involved  in  the  CALO  project. 

We  evaluate  two  clusterings — one  constrained  to  produce  four  clusters,  the  other 
to  produce  eight.  Both  produced  results  are  highly  correlated  with  Melinda  Gervasio’s 
labelings.  In  our  four-cluster  setup,  the  category  of  SRI  management  is  united  with 
the  category  of  non-SRI  people,  while  the  category  of  SRI  CALO  collaborators  (the 
largest  one)  is  split  to  two  clusters.  The  forth  category  (other  SRI  people)  forms  a 
single  clean  cluster,  and  the  borders  between  the  categories  are  successfully  identified, 
leading  to  62.3  ±  1.4%  accuracy  averaged  over  four  independent  runs. 

In  the  eight-cluster  result,  categories  of  SRI  management  and  non-SRI  people  are 
almost  perfectly  split  to  two  different  clusters,  while  other  SRI  employees  still  form 
one  cluster,  and  the  category  of  SRI  CALO  participants  is  now  distributed  over  five 
clusters,  one  of  which  contains  only  one  person  who  is  Melinda  Gervasio  herself.  The 
overall  precision  of  the  eight-cluster  system  is  as  high  as  76.6  ±  2.8%. 

4.7  Experimentation:  Web  appearance  disambiguation 

In  this  section,  we  illustrate  the  application  of  Comraf  clustering  to  a  real-world 
task.  In  [13]  we  introduced  Web  appearance  disambiguation  (WAD)  as  the  problem 
of  inferring  a  model  At  that  provides  a  binary  function  f(d,  h,  /C)  answering  whether 
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or  not  a  Web  page  d  refers  to  a  particular  person  h,  given  the  background  knowledge 
/C.  For  simplicity,  we  consider  only  the  case  when  /i’s  name  is  explicitly  mentioned 
in  the  page  d.  The  problem  might  be  easy  when  h’s  name  is  unique,  but  becomes 
difficult  when  h  has  a  common  name,  such  as  “Tom  Mitchell”.  Moreover,  we  do  not 
know  a  priori  whether  a  given  person  h  has  a  unique  name  or  not. 

Note  that  the  WAD  problem  is  similar  to,  but  not  a  special  case  of  the  problem 
of  person  name  disambiguation.  In  person  name  disambiguation,  given  a  collection 
of  documents  all  of  which  mention  a  person  name,  the  goal  is  to  distinguish  between 
documents  that  mention  different  people  who  have  this  name.  In  WAD,  in  contrast, 
the  goal  is  to  fold  a  subset  of  the  document  collection  in  which  the  person  of  interest 
is  mentioned,  while  filtering  out  documents  that  mention  unrelated  namesakes.  To 
our  opinion,  the  WAD  setup  is  more  realistic  than  person  name  disambiguation  in  the 
context  of  Web  search,  where  one  is  usually  interested  in  finding  information  about 
a  particular  person,  rather  than  about  all  people  with  the  same  name. 

As  perfect  background  knowledge  /C  is  in  most  cases  unavailable,  the  disambigua¬ 
tion  decision  must  be  made  using  some  limited  available  information.  Note  that  given 
no  background  knowledge  at  all,  the  WAD  problem  becomes  ill-defined:  in  order  to 
automatically  perform  the  task,  the  person  h  must  have  an  electronic  representation, 
which  cannot  be  constructed  without  any  prior  knowledge  about  the  person.  If  /C 
includes  training  data — pages  that  are  related  or  unrelated  to  the  person — the  WAD 
problem  is  reduced  to  a  binary  classification  task.  In  this  thesis,  however,  we  consider 
an  unsupervised  scenario. 

We  notice  that  as  soon  as  we  are  given  not  just  one,  but  at  least  two  names  of 
people  who  are  known  to  belong  to  one  social  network,  the  WAD  problem  becomes 
well-defined  and  solvable.  An  example  can  be  “Tom  Mitchell”  and  “William  Cohen” . 
Since  William  Cohen’s  name  appear  in  conjunction  with  Tom  Mitchell’s,  it  is  appar¬ 
ent  that  we  refer  to  William  Cohen  the  CMU  Professor,  and  not  to  the  former  US 
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Secretary  of  Defense.  It  is  a  rare  case  that  two  people  in  one  social  network  have  two 
namesakes  in  another.  However,  the  probability  of  having  a  collision  like  that  is  not 
zero.  We  can  minimize  this  probability  by  considering  N  >  2  names.  To  summarize, 
our  background  knowledge  /C  is  a  list  of  names  of  people  who  are  believed  to  belong 
to  h’s  social  network. 

In  a  recent  followup  paper  [113],  Yang  et  al.  claim  that  obtaining  a  few  names 
of  people  who  belong  to  the  same  social  network  is  very  hard.  However,  it  is  usu¬ 
ally  not  the  case.  In  many  real-world  cases  a  person  name  appears  in  a  context  of 
other  people’s  names.  These  can  be  co-authors  of  a  scientific  paper,  recipients  of  the 
same  email  message,  attendants  of  a  meeting  or  a  conference  etc.  It  is  important  to 
note  that  two  people  can  belong  to  the  same  social  network  without  even  knowing 
each  other.  For  instance,  given  two  randomly  chosen  names  of  machine  learning  re¬ 
searchers  h\  and  h2,  who  may  or  may  not  be  acquaintances,  the  disambiguation  task 
is  nevertheless  likely  to  be  solved,  as  Web  pages  referring  to  hi  and  h2  are  likely  to 
be  close  in  content,  or  close  in  the  Web  graph  (the  graph  of  hyperlinks). 

I11  this  section,  we  address  the  WAD  problem  as  a  clustering  task  in  a  Comraf. 
For  each  person  h  (out  of  a  list  of  N  people  from  one  social  network),  we  retrieve 
Jih  documents  that  mention  h’s  name.  The  resulting  collection  of  N  ■  nh  documents 
is  clustered  using  the  MDC  method  in  a  Comraf.  For  this  task,  our  Comraf  model 
is  very  simple  (see  Figure  4.3  left):  we  simultaneously  cluster  documents  are  their 
words. 

Out  of  the  k  document  clusters  constructed,  we  choose  one  cluster  to  be  the 
subset  of  documents  that  mention  people  of  interest,  and  we  delete  all  the  other 
clusters  that  potentially  mention  unrelated  namesakes.  Our  criterion  for  choosing 
the  “relevant”  cluster  is  the  level  of  interconnectedness  of  documents  in  the  cluster: 
for  each  document  di  we  construct  a  set  Ct  of  its  hyperlinks  (see  Section  4.7.4  for 
the  precise  definition  of  £,);  for  each  document  cluster  c3  we  construct  a  set  CCj  = 
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u(*  d-,)ecj  (A  ^  AO)  i-e-  ^ie  lini°n  °f  pairwise  intersections  of  hyperlink  sets;  finally 
we  cluster  c  with  the  largest  set  CjC.  In  Section  6.6.1  we  propose  another,  possibly 
more  adequate  solution  to  the  WAD  problem. 

4.7.1  Related  work 

Prior  to  our  paper  [13]  where  the  WAD  framework  was  introduced,  only  a  handful 
of  papers  addressed  the  problem  of  person  name  disambiguation.  Some  work  was 
done  on  person  name  disambiguation  in  a  collection  of  scientific  papers  [51].  In  the 
Web  domain,  we  are  aware  of  three  related  works  [4,  74,  43],  within  the  general  frame¬ 
work  of  entity  coreference  (see,  e.g.  [83,  49]).  Agglomerative  clustering  is  applied  in 
all  three.  Bagga  and  Baldwin  [4]  use  agglomerative  clustering  over  traditional  vector 
space  models  of  text  windows  around  a  personal  name  mention.  Mann  and  Yarowsky 
[74]  propose  a  richer  document  representation  involving  automatically  extracted  fea¬ 
tures.  Their  clustering  technique  however  can  be  basically  used  only  for  separating 
two  people  with  the  same  name.  Fleischman  and  Hovy  [43]  construct  a  MaxEnt  clas¬ 
sifier  to  learn  distances  between  documents  that  are  then  clustered.  This  method 
needs  to  be  provided  with  a  large  training  set.  Since  2005,  many  followup  papers 
have  been  published,  see  [76,  111,  113]  and  about  30  others. 

4.7.2  Evaluation  criterion 

To  define  our  evaluation  criterion,  let  c  be  the  constructed  cluster  of  documents 
that  we  believe  refer  to  people  of  our  interest,  and  let  cr  be  its  portion  consisting  of 
documents  that  actually  refer  to  people  of  our  interest.  Let  T>r  be  a  portion  of  the 
dataset  V,  that  consists  of  documents  referring  to  people  of  our  interest.  Precision 
of  the  cluster  c  is  then  defined  as  Prec  =  |cr|/|c|,  recall  as  Rec  =  |cr|/|'Dr|,  and 
F-measure,  standardly,  as  (2  Prec  Rec)/(Prec+Rec). 


51 


4.7.3  WAD  dataset 


For  evaluation  of  our  methods,  we  gathered  and  labeled  a  dataset  of  1085  Web 
pages.  In  this  section  we  describe  the  dataset  and  provide  some  interesting  insights 
into  its  structure. 

From  the  Feb  2,  2004  snapshot  of  the  CALO  email  data  (see  Section  4.6.2),  we 
selected  one  folder  from  Dr.  Melinda  Gervasio’s  email  directory  and  extracted  12 
person  names  that  appeared  in  headers  of  messages  found  in  this  folder.  The  names 
are  primarily  of  SRI  employees  and  CS  professors  from  various  universities.  All  of 
the  individuals  are  likely  to  be  present  on  the  Web. 

In  May  2004,  these  12  names  (in  quotation  marks,  i.e.  treated  as  phrases)  were 
issued  as  queries  to  Google  and  for  each  query  the  first  100  pages  were  retrieved.  We 
manually  filtered  the  pages,  removing  pages  in  non-textual  formats,  HTTPD  error 
pages  and  empty  pages.  We  labeled  the  remaining  pages  by  the  occupation  of  the 
individuals  whose  name  appeared  in  the  query.  In  10  out  of  12  cases,  the  names  were 
heavily  ambiguous,  thus  pages  representing  187  different  people  were  retrieved  given 
the  12  names  of  people  in  Melinda  Gervasio’s  social  network.  In  some  cases,  it  was 
difficult  to  decide  to  which  of  the  namesakes  the  page  referred.  To  determine  this,  we 
often  performed  manual  Web  investigations.  Table  4.5  shows  some  statistics  of  the 
dataset. 

Finally,  all  the  pages  were  cleaned  of  their  HTML  markup  and  scripts.  All  the 
URLs  mentioned  in  the  pages  were  extracted  and  placed  at  the  end  of  each  page, 
together  with  the  URL  of  the  page  itself.  The  dataset  is  publicly  available  at  http : 
//www. cs . umass . edu/~ronb/name_di sambiguation.html. 

The  most  ambiguous  personal  name  among  the  twelve  is  Tom  Mitchell.  Although 
the  CMU  Professor’s  pages  are  prevalent  over  all  the  others,  37  different  Tom  Mitchells 
can  be  distinguished  in  the  100  first  Google  hits,  including  professors  in  different  fields, 
musicians,  executive  managers,  an  astrologist,  a  hacker  and  a  rabbi.  Two  personal 


52 


Person  name 

Position 

Number  of 
pages 

Number  of 
categories 

Number  of 
relevant  pages 

Adam  Cheyer 

SRI  Manager 

97 

2 

96 

William  Cohen 

CMU  Professor 

88 

10 

6 

Steve  Hardt 

SRI  Engineer 

81 

6 

64 

David  Israel 

SRI  Manager 

92 

19 

20 

Leslie  Pack  Kaelbling 

MIT  Professor 

89 

2 

88 

Bill  Mark 

SRI  Manager 

94 

8 

11 

Andrew  McCallum 

UMass  Professor 

94 

16 

54 

Tom  Mitchell 

CMU  Professor 

92 

37 

15 

David  Mulford 

Stanford  Undergrad 

94 

13 

1 

Andrew  Ng 

Stanford  Professor 

87 

29 

32 

Fernando  Pereira 

UPenn  Professor 

88 

19 

32 

Lynn  Voss 

SRI  Engineer 

89 

26 

1 

OVERALL: 

1085 

187 

420 

Table  4.5.  Statistics  of  the  WAD  dataset.  Categories  are  different  namesakes 
or  other  in  case  if  the  page  does  not  refer  to  any  of  the  namesakes.  The  last  column 
shows  the  number  of  pages  that  actually  mention  the  person  of  our  interest. 


names  out  of  the  12,  Adam  Cheyer  and  Leslie  Pack  Kaelbling,  seem  to  be  unique 
in  the  Internet.  However,  for  either  of  them,  one  page  was  retrieved  that  did  not 
contain  any  part  of  their  names.  These  two  pages  were  put  into  respective  categories 
other.  Two  other  people,  David  Mulford  and  Lynn  Voss,  seem  to  have  very  little 
Web  presence.  Only  one  page  out  of  the  100  was  related  to  any  of  the  two.  William 
Cohen’s  and  David  Mulford’s  namesakes  are  well  known  politicians:  the  former  US 
Secretary  of  Defense  William  S.  Cohen  and  the  current  US  Ambassador  to  India 
David  C.  Mulford.  Naturally,  the  distributions  of  Cohen’s  and  Mulford’s  pages  are 
heavily  biased  toward  the  politicians  who  are  well  represented  on  the  Web. 

An  interesting  phenomenon  is  observed  for  the  names  David  Israel  and  Bill  Mark. 
Many  of  pages  that  responded  to  these  queries  only  accidently  contain  the  two  words 
adjacent  to  each  other:  Bill  Mark’s  pages  often  refer  to  mark-ups  of  certain  bills,  or 
just  list  people’s  first  names  (e.g.  “Thanks  Bill,  Mark!”),  while  some  of  David  Israel’s 
pages  discuss  Israeli  history  and  King  David.  None  of  these  pages  were  removed  from 
the  dataset,  despite  the  fact  that  they  are  clearly  unrelated  to  a  particular  living 
person. 
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A  major  challenge  for  the  WAD  system  is  the  pages  of  Bill  Mark  and  Fernando 
Pereira.  Both  researchers  have  namesakes  who  are  also  researchers  in  Computer 
Science:  another  Bill  Mark  is  a  UTexas  Professor,  while  another  Fernando  Pereira  is 
a  Professor  at  Instituto  Superior  Tecnico  in  Portugal.  We  term  these  pairs  “doubles” . 
To  separate  them  is  an  especially  difficult  task.  The  opposite  problem  occurs  with 
Steve  Hardt:  he  appears  on  the  Web  not  only  as  an  SRI  engineer,  but  also  as  a  creator 
of  an  online  game.  We  ourselves  are  actually  unsure  whether  this  is  one  person  or 
two  different  people,  but  most  likely  this  is  one  person. 

4.7.4  Baseline:  link  structure  model 

As  our  baseline,  we  propose  a  one-class  clustering  method  based  on  link  structure 
analysis  of  Web  pages  (see  [13]  for  some  additional  details).13  Let  graph  Gls  = 
(' V ,  HC)  be  the  Link  Structure  Graph  over  a  set  of  Web  pages  V,  where  7 iC  is  a  set 
of  hyperlink  connections  between  Web  pages  in  V.  We  say  that  two  Web  pages  di 
and  have  a  hyperlink  connection,  if  the  sets  of  their  hyperlinks,  Ct  and  J2t> ,  have 
a  non-empty  intersection:  £;  fl  C,>  ^  0.  Let  us  now  define  the  set  of  hyperlinks. 

For  a  Web  page  d,  we  define  a  function  URL(d)  to  be  the  domain  of  d' s  URL  with 
its  first  directory  in  case  if  this  directory  exists.  For  example,  given  page  d\  with  URL 
http :  //www.  cs .  umass .  edu/~ronb/timeline .  html  the  function  URL(d\)  will  return 
www.cs.umass.edu/~ronb.  Given  page  d2  with  URL  http://www.cs.umass.edu/ 
the  function  URL(d2)  will  return  www.cs.umass.edu.  By  this,  we  capture  the  intu¬ 
ition  that  full  URLs  can  be  too  specific,  while  URLs’  domains  can  be  too  general. 

Define  a  set  POP  to  be  a  set  of  URLs  with  extremely  popular  domains,  such  as 
www.amazon.com.  The  popularity  of  a  domain  is  determined  using  operator  Mink 
of  Google’s  command  line.  For  a  Web  page  d,  define  a  set  HOP(d)  as  a  set  of  Web 
pages  that  can  be  reached  from  d  while  following  d’s  hyperlinks. 

13In  [18]  we  proposed  another  link  analysis  method,  based  on  a  heuristic  search  in  the  Web  graph. 
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Figure  4.6.  Relevant  and  irrelevant  Web  pages  according  to  the  Link  Structure 
model.  Relevant  pages  are  within  the  5-radius  from  the  Core  Connected  Component. 
White,  gray  and  black  colors  indicate  that  the  pages  are  retrieved  by  three  different 
queries. 


Definition  4.7.1  A  set  of  hyperlinks  of  a  Web  page  di  is  defined  as 


Li  =  ( URL(di )  U  URL(HOP(di )))  \  POP. 

That  is,  C{  is  df  s  URL  and  URLs  that  appear  in  di,  after  a  generalization  (using  the 
function  URL)  and  removal  of  URLs  with  too  popular  domains. 

The  graph  Gls  consists  of  a  number  of  connected  components.  Our  task  is  to 
find  a  Core  Connected  Component  ( CCC)  of  Web  pages  that  mention  people  of  our 
interest.  We  naturally  expect  Web  pages  from  CCC  to  interconnect  much  more  than 
non-CCC  Web  pages  would  interconnect.  Of  special  importance  is  that  CCC  pages 
referring  to  different  people  are  likely  to  interconnect,  while  non-CCC  pages  referring 
to  different  people  would  probably  not  connect  to  each  other.  We  could  have  decided 
that  the  Maximal  Connected  Component  (MCC)  of  graph  Gls  would  be  the  core 
connected  component.  However,  there  can  be  a  case  where  the  MCC  consists  only  of 
Web  pages  retrieved  in  response  to  a  single  query — this  can  happen  when  pages  of 
one  person  h  are  heavily  interconnected.  If  this  person  h  appears  to  be  an  irrelevant 
namesake,  such  MCC  will  be  totally  irrelevant.  Therefore,  we  come  up  with  the 
following  definition: 
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Definition  4.7.2  Denote  Core  Connected  Component  (CCC)  Co  as  the  largest  con¬ 
nected  component  in  Gls  that  consists  of  pages  retrieved  by  more  than  one  query. 

Definition  4.7.3  The  Link  Structure  Model  Mls  is  a  Va^  (CC,6),  where  CC  is  the 
set  of  all  connected  components  of  the  graph  Gls  (note  that  Co  G  CC),  and  5  is  a 
distance  threshold. 

Our  intuition  is  that  the  pages  of  the  CCC  and  of  a  few  connected  components  that 
are  close  to  the  CCC  refer  to  people  of  our  interest,  while  the  others  do  not.  Figure  4.6 
illustrates  this  intuition.  To  find  the  connected  components  that  are  close  to  CCC, 
we  apply  the  popular  cosine  similarity  measure,  while  introducing  a  novel  variation 
of  the  tfidf  term  weighting  function,  that  we  call  Google  tfidf: 


Google  stfidf(w) 


tf(w) 

log  Google_df{w)  ’ 


(4.9) 


where  Googlejdf{w )  is  the  estimated  total  results  count  of  the  term  w  if  provided  as 
a  query  to  Google.  This  document  frequency  count  appears  to  be  the  most  adequate 
measurement  of  the  commonness  of  the  term  w.  The  estimated  total  results  counts 
of  words  in  our  dataset  were  obtained  using  Google  API.14 

We  do  not  explicitly  set  the  distance  threshold  5.  Instead,  given  that  in  our  dataset 
(see  Section  4.7.3)  roughly  one  third  of  all  Web  pages  refer  to  people  of  our  interest, 
we  set  5  such  that  one  third  of  the  pages  in  the  dataset  are  within  the  threshold.15 


4.7.5  Comparative  results 

Along  with  our  baseline  method  from  Section  4.7.4,  we  implemented  greedy  ag- 
glomerative  clustering  (as  applied  in  the  related  work  [4,  74,  43]),  based  on  the  cosine 


14http :  // www  .google .  com  /  apis  / 

15As  in  any  unsupervised  learning  problem,  the  choice  of  the  desired  number  of  clusters  or,  dually, 
of  the  cluster  sizes,  is  a  problematic  issue.  We  do  not  attempt  to  address  this  issue  here;  instead, 
we  fix  the  size  of  the  desired  cluster  based  on  our  domain  knowledge. 
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Method 

Precision 

Recall 

F-measure 

Agglomerative 

61.7 

53.3 

57.2 

Link  Structure 

84.2 

71.8 

77.5 

2-modal  Comraf 

87.3  ±  1.7 

71.3  ±2.5 

78.4  ±0.9 

Table  4.6.  Web  appearance  disambiguation  results.  Bi-modal  Comraf  results 
are  averaged  over  4  independent  runs,  with  the  standard  error  of  the  mean  reported 
after  the  ±  sign. 

similarity  measure  between  clusters  and  the  augmented  tfidf  weighting  function  from 
Equation  (4.9).  We  did  not  measure  interconnectedness  of  the  clusters,  we  simply 
chose  the  cluster  whose  F-measure  was  the  highest  among  all  the  clusters.  The  moti¬ 
vation  for  this  choice  was  that  we  would  like  to  show  that  our  methods  overcome  the 
best  possible  results  of  agglomerative  clustering. 

The  summary  of  the  results  is  in  Table  4.2.  As  it  can  be  seen  from  the  table,  both 
link  structure  and  MDC  methods  significantly  outperform  agglomerative  clustering, 
while  MDC  shows  slightly  better  performance  than  the  link  structure  method.  A 
relatively  high  deviation  in  precision  and  recall  of  the  MDC  algorithm  is  caused  by 
the  fact  that  it  never  ends  up  with  clusters  of  exactly  the  same  size.  Interestingly, 
this  deviation  almost  does  not  affect  the  F-measure:  the  precision  trades  off  quite 
well  against  the  recall. 

Analyzing  the  results  by  person,  we  can  see  that  for  quite  a  few  people  both 
precision  and  recall  are  amazingly  high,  e.g.  for  David  Israel,  Leslie  Pack  Kaelbling, 
Andrew  McCallum,  and  Andrew  Ng.  It  is  also  notable  that  the  only  relevant  page  of 
David  Mulford  (the  Stanford  student)  is  found.  As  could  be  anticipated,  the  worst 
precision  is  for  Bill  Mark  and  and  Fernando  Pereira,  because  both  of  them  have 
“doubles” .  However,  only  9  of  23  pages  that  refer  to  Bill  Mark  the  UTexas  Professor 
appear  in  the  category  of  relevant  pages.  The  worst  recall  is  for  Steve  Hardt  and 
Adam  Cheyer.  This  can  be  easily  explained  for  Steve:  most  of  his  pages  refer  to  an 
online  game  he  created — relevance  of  these  pages  would  be  too  difficult  to  determine. 
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Figure  4.7.  Precision/recall  curve  of  the  MDC  algorithm.  Points  correspond  to 
consequent  iterations  of  the  algorithm  (merges  of  Web  page  clusters). 

As  for  Adam,  the  low  result  is  a  bit  surprising,  but  it  still  makes  sense:  Adam’s  name 
often  appears  in  an  industrial  context,  while  the  language  of  most  correctly- found 
pages  is  purely  academic — many  of  Adam’s  pages  fall  too  far  from  the  central  cluster. 
Unfortunately,  the  single  relevant  page  about  Lynn  Voss  was  not  found,  probably  for 
the  same  reason:  it  uses  an  industrial  vocabulary. 

The  problem  of  disambiguating  the  “doubles” — the  two  Bill  Marks  and  two  Fer¬ 
nando  Pereiras  who  all  work  in  Computer  Science — can  in  fact  be  handled  within 
the  Comraf  framework.  At  some  intermediate  stages  during  the  course  of  the  MDC 
algorithm  the  most  interconnected  cluster  is  relatively  small  but  extremely  clean. 
Figure  4.7  shows  the  precision/recall  curve  for  one  run  of  the  MDC  algorithm.  It  can 
be  seen  in  the  graph  that  when  the  recall  of  the  relevant  cluster  is  around  45%  (there 
are  five  clusters  overall),  the  precision  is  very  high  (above  98%). 16  This  cluster  con¬ 
tains  two  pages  of  Bill  Mark  the  SRI  Manager  and  none  of  the  pages  of  Bill  Mark  the 
UTexas  Professor;  it  also  contains  15  pages  of  Fernando  Pereira  the  UPenn  Professor 
and  only  one  page  of  Fernando  Pereira  the  Professor  of  Instituto  Superior  Tecnico. 

16Notably,  when  the  recall  is  around  15%  (17  clusters  overall),  we  obtain  100%  precision. 
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4.8  Experimentation:  clustering  scientific  papers 

In  this  section,  we  test  the  Comraf  model  on  another  type  of  data:  a  collection  of 
scientific  papers.  The  goal  of  this  experiment  is  as  follows.  From  the  model  analysis 
in  Section  4.6.6  we  can  infer  that  in  many  cases  tree-structured  models  perform  com¬ 
parably  to  loopy  models.  The  question  that  we  ask  in  this  section  is  whether  there 
exists  a  case  where  a  loopy  model  performs  significantly  better  than  a  corresponding 
tree-structured  one.  In  Section  4.6.6  we  provided  an  evidence  for  the  advantage  of 
loopy  models,  where  the  underlying  inference  method  is  CWO.  In  this  section,  we 
show  the  advantage  of  loopy  models  when  the  underlying  algorithm  is  MDC. 

The  evidence  given  in  this  section  has  an  important  implication:  as  discussed  in 
Section  4.1,  if  a  Comraf  graph  is  tree-structured,  then  our  objective  function  (4.3) 
is  a  factorized  version  of  Multi- Information  (4.2).  That  is,  Comraf  models  of  a  tree 
structure  are  equivalent  in  their  modeling  power  to  the  hard  version  of  multivariate 
Information  Bottleneck  (mIB)  [97]  where  the  Multi- Information  is  used.  Loopy  Com¬ 
raf  models,  however,  are  not  equivalent  to  mIB.  As  we  show  below,  in  some  cases 
loopy  Comraf  models  obtain  higher  results  than  corresponding  tree-structure  ones, 
which  means  that  in  those  cases  the  Comraf  framework  is  preferable  over  mIB. 

Our  dataset  was  created  by  David  Mimno  from  a  repository  of  scientific  papers 
collected  for  the  REXA  project.1'  The  dataset  consists  of  4887  conference  papers, 
published  at  ten  venues:  ACL,  ICCV,  ICRA,  IJCAI,  KDD,  NIPS,  SIGIR,  SIGMOD, 
STOC,  and  WWW.  In  our  data,  a  significant  number  of  papers  belong  to  each  of 
the  ten  venues:  between  224  and  933  papers.  From  the  paper  titles,  we  extracted 
1436  words,  each  of  which  appeared  in  at  least  2  titles.  We  also  extracted  9841  words 
from  paper  abstracts,  each  of  which  appeared  in  at  least  2  abstracts.  Citations  in  the 
papers  were  automatically  co-referenced  using  the  REXA  software  system.  Again,  as 


11  http : //rexa. info/ 
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(a) 

(b) 

(c) 

(d) 

(e) 

@ 

© 

/^\ 

/^\ 

© 

© 

©  © 

© - © 

38.8  ±  0.5% 

40.7  ±  0.7% 

55.0  ±  0.7% 

61.4  ±0.6% 

63.9  ±  0.7% 

Table  4.7.  Clustering  scientific  papers.  Comraf  models  for  clustering:  (a)  doc¬ 
uments  and  title  words;  (b)  documents  and  citations;  (c)  documents,  title  words 
and  citations  in  a  tree-structured  model;  (d)  documents,  title  words  and  citations  in 
a  loopy  model;  (e)  documents  and  abstract  words.  The  bottom  line  is  the  micro- 
averaged  clustering  accuracy  obtained  by  those  models. 


in  the  case  of  words,  we  removed  citations  that  appeared  in  only  one  paper,  resulting 
in  11,143  distinct  citations. 

Our  goal  is  to  cluster  documents  by  their  venues.  We  consider  five  Comraf  models 
presented  in  Table  4.7.  First,  we  test  two  bi-modal  Comrafs,  where  documents  D  are 
clustered  with  their  title  words  Wt  and  with  their  citations  C .  Second,  we  experi¬ 
ment  with  two  tri-modal  Comrafs  (tree-structured  and  loopy),  where  D,  Wt  and  C 
are  clustered  simultaneously.  Finally,  we  present  a  bi-modal  Comraf  for  clustering 
documents  D  and  abstract  words  W a- 

Our  underlying  clustering  method  is  a  sequential  MDC  (see  Section  4.3).  We 
cluster  words  and  citations  top-down,  while  clustering  documents  bottom-up.  Our 
clustering  schedule  is  a  plain  round-robin.  The  algorithm  stops  when  the  desired 
number  of  document  clusters  (i.e.  10,  which  equals  the  number  of  venues)  is  reached. 

The  micro-averaged  clustering  accuracy  results  are  presented  in  the  bottom  line  of 
Table  4.7.  As  can  be  seen,  neither  title  words  nor  citations  are  good  document  repre¬ 
sentations.  Only  about  40%  accuracy  is  obtained  in  a  bi-modal  Comraf  using  either 
title  words  or  citations.  However,  the  result  of  a  tree-structured  tri-modal  Comraf 
(where  documents  are  clustered  simultaneously  with  title  words  and  citations)  is  no- 
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tably  15%  higher.  Of  particular  importance  is  that  adding  the  title  words/citations 
interaction  improves  this  result  by  another  6%  accuracy  (on  the  absolute  scale).  Since 
a  loopy  Comraf  model  like  this  is  not  equivalent  to  any  model  in  the  multivariate  IB 
framework,  this  result  demonstrates  the  superiority  of  the  Comraf  modeling  frame¬ 
work  over  multivariate  IB.18 

The  Comraf  model  that  achieves  the  best  performance  on  the  scientific  paper 
clustering  task  is  the  one  where  papers  are  represented  over  words  in  their  abstracts 
(see  the  last  column  in  Table  4.7).  Adding  another  modality  to  this  setup  (such  as 
citations)  causes  a  significant  drop  in  the  clustering  accuracy.  This  result  implies  that 
the  abstract  words’  modality  is  dense  enough  and  much  less  noisy  than  title  words  or 
citations.  Whenever  the  abstracts’  data  is  available,  using  it  would  be  preferable  over 
using  the  other  modalities.  However,  if  the  abstracts’  data  is  unavailable,  we  show 
that  using  a  combination  of  two  noisy  modalities  such  as  title  words  and  citations 
leads  to  almost  the  same  result. 

4.9  Experimentation:  clustering  documents  by  genre 

So  far,  we  have  considered  clustering  documents  by  their  topic.  Topics,  however, 
are  not  the  only  way  in  which  someone  might  want  to  select  groups  of  documents. 
Aspects  such  as  genre,  opinion,  authorship,  style,  author’s  mood,  and  so  on  are 
interesting  dimensions  along  which  clustering  results  might  break.  In  this  section, 
we  focus  on  techniques  appropriate  for  such  non-topical  clustering,  with  a  particular 
emphasis  on  genre.  Although  the  field  of  non-topical  (supervised)  classification  is 
well  explored  in  the  literature  (a  lot  of  work  was  done  on  classification  by  genre 

18Note  that  this  is  not  the  only  advantage  of  Comrafs  over  multivariate  IB  models.  The  Comraf 
framework  is  substantially  simpler  and  more  intuitive  (e.g.  the  multivariate  IB  introduces  in-space 
and  out-space  concepts  which  are  unnecessary  in  Comrafs) .  In  contrast ,  Comraf  inference  algorithms 
are  more  complex  and  effective  than  those  proposed  for  the  multivariate  IB  (see  Section  4.6.5  for  a 
discussion) . 
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[59,  60,  42,  71,  92],  by  text  authorship  [77,  3],  by  writer’s  gender  [63],  tone  [107,  85] 
and  mood  [82],  as  well  as  by  familiarity  with  the  topic  of  the  discussion  [64]),  we 
believe  that  the  problem  of  genre  clustering  had  not  been  comprehensively  studied 
before  we  approached  it  in  [9,  15]. 

To  apply  the  Comraf  framework  to  the  task  of  clustering  by  genre,  we  first  have  to 
decide  about  modalities  that  would  best  match  the  task.  Documents  are  labeled  with 
genres  on  the  basis  of  external  criteria  such  as  intended  audience,  purpose  and  activity 
type  [70].  The  notion  of  genre  can  be  described  in  terms  of  the  syntax/semantics 
duality  of  text:  documents  of  different  genres  use  different  syntactic  constructions 
and/or  different  vocabulary,  ft  is  not  obvious  whether  syntactic  or  semantic  features 
play  a  major  role  in  clustering  documents  by  genre.  We  propose  to  take  advantage 
of  both.  Besides  the  document  modality,  we  consider  two  other  modalities:  words 
(that  correspond  to  documents’  vocabularies)  and  Part-Of-Speech  (POS)  n-grams 
(that  correspond  to  the  syntactic  structure  of  text).  POS  n-grams  are  extracted  from 
sentences  in  an  incremental  manner:  the  first  n-gram  starts  with  the  POS  tag  of  the 
first  word  in  the  sentence,  the  second  one  starts  with  the  tag  of  the  second  word  etc. 
For  example,  out  of  the  sentence 

<PNP>It  <VBZ> ’ s  <AT0>a  <AJ0>real  <NNl>holiday  <PUN>. 

we  extract  four  trigrams: 

PNP_VBZ_AT0 ,  VBZ_AT0_AJ0 ,  AT0_AJ0_NN1,  AJ0_NN1_PUN. 

Given  a  document  collection,  let  D  be  a  random  variable  over  its  documents,  W  be 
a  random  variable  over  its  words,  and  S'  be  a  random  variable  over  the  POS  n-grams 
of  its  words.  We  apply  a  multi-modal  Comraf  model  (Section  3)  for  constructing  a 
clustering  dc*  of  documents,  a  clustering  wc*  of  words  and/or  a  clustering  sc*  of  POS 
n-grams,  by  maximizing  the  objective  derived  from  Equation  (4.3).  In  this  section, 
we  consider  four  Comraf  models  for  clustering  by  genre: 
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(a)  (b)  (c)  (d) 


Figure  4.8.  Comraf  graphs  for:  (a)  1-way  document  clustering  with  POS  unigrams 
as  an  observed  r.v.  (shaded  node);  (b)  2- way  clustering  of  documents  and  POS 
bigrams  (same  as  for  POS  3-grams  or  4-grams);  (c)  2-way  clustering  with  BOW;  (d) 
3-way  clustering  with  POS  bigrams  and  BOW. 


1.  POS  unigrams:  Since  the  number  of  POS  tags  in  any  tagging  system  is 
relatively  small,  it  makes  no  sense  to  cluster  POS  unigrams.  Therefore,  we 
apply  a  1-way  model  for  clustering  documents  using  the  Comraf  graph  shown 
in  Figure  4.8(a).  The  objective  function  from  Equation  (4.3)  in  this  simple  case 
has  the  form  of  I(D ;  S). 

2.  POS  n-grams,  where  n  >  1.  The  number  of  unique  POS  n-grams  of  order 
higher  than  1  is  exponential  in  n,  so  clustering  them  would  be  necessary.  We 
perform  a  2-way  clustering  with  the  Comraf  graph  from  Figure  4.8(b)  and  the 
objective  I{D:  S). 

3.  Bag-Of- Words:  The  number  of  unique  words  in  our  dataset  is  comparable 
with  the  number  of  POS  trigrams,  so  in  analogy  to  the  previous  model,  we 
perform  a  2-way  clustering  with  the  Comraf  graph  of  Figure  4.8(c)  and  the 
objective  I(D ;  W). 

4.  BOW+POS  hybrid:  We  combine  contextual  information  of  BOW  and  stylis¬ 
tic  information  of  POS  n-grams  into  a  3-way  clustering  model,  where  we  simul¬ 
taneously  cluster  documents,  words  and  bigrams  of  POS  tags.  Over  the  Comraf 
graph  of  Figure  4.8(d),  we  maximize  the  sum  I(D ;  S)  +  /(D;  W). 
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Doc  representation 

fc-means 

LDA 

Comraf 

Bag-0 f-  Words 

9.1% 

55.4  ±0.1% 

55.7  ±  0.2% 

POS  bigrams 

23.2% 

44.7  ±  0.2% 

51.0  ±  0.2% 

BOW  +  POS  bigr 

n/a 

n/a 

58.5  ±0.6% 

Table  4.8.  Clustering  by  genre.  Micro- averaged  clustering  accuracy  on  the  BNC 
corpus,  averaged  over  four  independent  runs.  Standard  error  of  the  mean  is  shown 
after  the  ±  sign.  Comraf  results  with  other  POS  tuples,  besides  bigrams,  are  in 
Figure  4.9 (left).  The  BOW+POS  hybrid  setup  is  only  applicable  in  Comrafs. 


4.9.1  Dataset 

We  evaluate  our  models  on  the  British  National  Corpus  (BNC)  [24],  We  employ 
David  Lee’s  ontology  of  BNC  genres  [70]  with  46  genres  covering  most  aspects  of  mod¬ 
ern  literature  such  as  fiction  prose,  biography,  technical  report,  news  script  and  others. 
To  perform  fair  evaluation  using  micro-averaged  clustering  accuracy  (Section  4.6.1), 
we  choose  21  largest  categories,  for  each  of  which  we  uniformly  at  random  choose  32 
documents,  so  our  resulting  dataset  consists  of  672  documents.  The  BNC  texts  are 
formatted  using  the  SGML  markup  language.  We  remove  all  markup,  lowercase  the 
text,  and  delete  stopwords  and  low-frequency  words.  All  words  in  the  BNC  corpus 
are  semi-manually  tagged  using  91  POS  tags,  four  of  which  refer  to  punctuation.  The 
resulting  dataset  has  63,634  unique  words;  and  5864  POS  bigrams.  Since  the  overall 
number  of  unique  POS  trigrams  and  fourgrams  is  prohibitively  large,  we  apply  more 
aggressive  term  filtering:  we  consider  trigrams  that  appear  in  at  least  10  documents 
(44,499  trigrams  overall)  and  fourgrams  that  appear  in  between  10  and  99  documents 
(114,476  fourgrams). 

4.9.2  Comparative  results 

We  compare  the  results  of  Comraf  models  (with  the  MDC  optimization  algorithm) 
with  the  results  of  fc-means  (Weka  implementation),  as  well  as  of  Latent  Dirichlet 
Allocation  (LDA).  As  in  Section  4.6,  we  use  Xuerui  Wang’s  LDA  implementation  [78] 
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that  performs  Gibbs  sampling  with  10000  sampling  iterations.  Table  4.8  summarizes 
our  results. 

As  we  can  see  from  Table  4.8,  MDC  achieves  more  than  50%  accuracy  with  both 
BOW  and  POS  bigram  document  representations.  Note  that  a  random  assignment 
of  documents  into  clusters  would  lead  to  about  5%  accuracy  on  our  dataset,  so  above 
50%  accuracy  is  an  impressive  result  for  a  purely  unsupervised  method  on  a  large, 
well-balanced  dataset.  The  LDA+BOW  system  obtains  exactly  the  same  accuracy  as 
MDC+BOW  does.  However,  LDA  demonstrates  strictly  inferior  performance  (lower 
than  MDC  by  6%  absolute)  on  the  POS  bigram  representation.  We  can  also  see  that 
MDC+BOW  significantly  outperforms  MDC+POS  (by  more  than  4%  absolute).  This 
observation  may  imply  that  contextual  features  (such  as  words)  play  a  more  important 
role  for  genre  classification  than  stylistic  features  (such  as  POS  rt-gr  arris) . 

To  give  some  insight  on  the  differences  in  MDC  performance  on  BOW  and  POS 
bigrams,  we  present  Table  4.9  that  shows  the  distribution  of  documents  of  each  genre 
over  the  generated  clusters.  For  each  genre  we  show  a  list  of  sizes  (in  number  of 
documents)  of  this  genre’s  representation  in  various  clusters.  We  sort  this  list  by 
the  size  of  the  representation  from  the  largest  to  the  smallest.  An  asterisk  after  the 
number  of  documents  means  that  this  genre  is  dominant  in  the  corresponding  cluster. 
A  heavy  tailed  distribution  (such  as  the  one  of  W_non_ac_soc_science)  implies  that 
the  genre  is  spread  over  many  clusters  which  is  clearly  a  failure.  In  contrast,  a  peaked 
distribution  (e.g.,  of  W_non_ac_tech_engin)  with  an  asterisk  on  its  largest  component 
means  that  the  genre  was  successfully  identified. 

As  we  can  see  from  the  table,  MDC  performs  similarly  on  BOW  and  POS  bigrams. 
However,  some  significant  differences  can  be  found.  For  example,  genres  WJbiography, 
W.commerce  and  W_institut_doc  are  successfully  identified  by  MDC+BOW  but  not 
by  MDC+POS,  while  MDC+POS  better  recognizes  W_newsp_brdsht_nat_social  and 
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Genre 

MDC  with  POS 
bigrams 

MDC  with  BOW 

LDA  with  BOW 

MDC  with  BOW 
and  POS  bigrams 

W-ac-humanities-arts 

9*  6*  6  4  2  2  1  1  1 

9*  6*  5  5  3  2  1  1 

7655441 

9  6*  554111 

W-ac-nat .science 

23*  4  2  2  1 

24*  6  1  1 

12*  11*  9 

27*  4  1 

W_ac_politJaw_edu 

14*  8  5  2  1  1  1 

20*  5  2  2  1  1  1 

19*  7  4  2 

17  6  4  2  1  1  1 

W-CLC-Socscience 

11*  9*  6  5  1 

12*  10*  7  1  1  1 

12*  9*  8*  1  1  1 

16*  7  6  3 

W-advert 

14*  113  2  2 

18*  3  3  2  2  1  1  1  1 

22*  2  2  2  1  1  1  1 

23*  2  1  1  1  1  1  1  1 

W-biography 

15*  8  6  1  1  1 

12  7  6  3  2  1  1 

16*  6  4  2  2  1  1 

16*  6  6*  2  1  1 

W ^commerce 

10*  5  5  4  2  2  1  1  1  1 

13  10  6  1  1  1 

16  5422111 

9*  9433211 

W-fict-prose 

22*  7  3 

25*  6  1 

30*  2 

24*  6  2 

W-institut-doc 

15*  6  5  5  1 

18  6  4  1  1  1  1 

17*  7  4  2  2 

14  11*  3  1  1  1  1 

W-newsp-brdsht-nat _ 

25*  5  1  1 

28*  1  1  1  1 

30*  2 

27*  2  2  1 

arts 

W-newsp_brdsht_nat_ 

26*  2  1  1  1  1 

32* 

28*  2  1  1 

31*  1 

commerce 

W-newspJ)rdsht_nat _ 

32* 

32* 

30*  2 

32* 

report 

W-TiewspJ)rdsht_nat _ 

9744221111 

11*  643221111 

10  76321111 

14  63222111 

social 

W jnewsscript 

32* 

32* 

31*  1 

32* 

W-non_ac -humanities- 

11*  8  3  2  2  2  1  1  1  1 

9*  6533222 

10*  7  5  3  2  2  2  1 

14*  5  3  3  2  1  1  1  1  1 

arts 

W-non-ac-nat-Science 

14*  5*  3  2  2  2  1  1  1  1 

18*  11  2  1 

11*  9  7  2  2  1 

29*  1  1  1 

W-non-ac-polit-law-edu 

11*  443322111 

11  10*  5  3  2  1 

10*  10*  3  3  2  2  1  1 

10*  6  5  5  2  2  1  1 

W-nou-ac-Socscience 

5543322221111 

7544322221 

7655321111 

554333222111 

W-nomac-tech-engin 

32* 

32* 

32* 

32* 

W-popJore 

11  6  6  5  4 

10*  9*  4  4  2  2  1 

12  8  6  3  2  1 

16*  8  3  2  2  1 

W-religion 

11*  5  4  4  2  2  1  1  1  1 

18*  6  2  1  1  1  1  1  1 

20*  6*  2  1  1  1  1 

18*  6*  3  1  1  1  1  1 

Table  4.9.  Performance  of  various  methods  per  genre.  For  each  genre  we 
show  a  list  of  sizes  (in  number  of  documents)  of  this  genre’s  representation  in  vari¬ 
ous  clusters.  We  sort  this  list  by  the  size  of  the  representation  from  the  largest  to 
the  smallest.  An  asterisk  after  the  number  of  documents  means  that  this  genre  is 
dominant  in  the  corresponding  cluster. 


W_pop_lore.  A  3-way  MDC  with  both  BOW  and  POS  that  would  take  advantage  of 
the  both  approaches  may  have  a  good  chance  to  show  even  better  results. 

Indeed,  we  obtain  a  strong  result  with  the  3-way  MDC:  58.5%  accuracy.  The 
last  column  of  Table  4.9  presents  the  analysis  of  this  result  by  genre.  For  many 
genres  (such  as  W_non_ac_nat_science)  we  enlarge  their  dominant  representations. 
We  also  manage  to  identify  four  of  the  five  genres  that  were  in  disagreement  be¬ 
tween  BOW  and  POS  models  (as  discussed  above).  However,  we  no  longer  recognize 
W_ac_polit_law_edu,  which  indicates  that  the  results  might  potentially  be  improved 
even  more. 

One  could  argue  that  the  direct  comparison  of  results  obtained  by  the  BOW  and 
POS  bigram  models  is  actually  unfair  because  the  number  of  BOW  features  is  one 
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order  greater  than  the  number  of  POS  bigrams,  so  that  the  BOW  model  naturally 
outperforms  the  POS  bigram  model  because  it  just  contains  more  information.  How¬ 
ever,  this  argument  cannot  be  empirically  proved.  We  test  MDC  with  POS  trigrams 
and  fourgrams,  as  well  as  with  POS  unigrams,  and  show  that  while  the  MDC  perfor¬ 
mance  with  unigrams  is  significantly  lower  than  with  bigrams,  trigrams  and  fourgrams 
do  not  significantly  improve  the  results  of  bigrams.  In  Figure  4.9(a)  we  can  see  that 
when  moving  from  bigrams  to  trigrams  and  fourgrams,  the  graph  has  a  slightly  pos¬ 
itive  slope,  however  the  results  become  noisier  (the  standard  error  becomes  higher) 
which  diminishes  statistical  significance  of  the  improvement.  A  conclusion  that  can 
be  made  from  this  experiment  is  that  the  Bag-Of-POS-bigrams  model  appears  to  be 
rich  enough  to  capture  genres  of  documents. 

A  common  belief  is  that  stopwords  and  other  high  frequency  words  can  be  good 
features  for  discrimination  of  documents  by  genre  (see,  e.g.  [100]).  It  is  interesting  to 
see  whether  we  can  support  this  hypothesis  with  empirical  evidence.  To  show  this,  we 
conduct  the  following  experiment.  We  put  various  thresholds  on  the  low  frequency 
words  in  the  BOW  representation  of  the  documents.  We  consider  four  such  thresholds: 
our  initial  setup,  when  we  filter  out  words  that  appear  in  less  than  3  documents,  as 
well  as  three  new  ones:  10,  20  and  50  documents.  Note  that  the  new  thresholds  and 
especially  the  most  restrictive  one  (50)  leave  us  with  highly  frequent  words  only:  since 
our  dataset  consists  of  672  documents,  filtering  out  words  that  appear  in  less  than 
50  documents  causes  removal  of  over  93%  of  unique  words  from  the  dataset.  We  run 
MDC  on  the  four  representations.  Figure  4.9(b)  shows  results  of  this  experiment.  We 
can  see  that  although  the  graph  has  a  negative  slope,  the  decrease  in  the  results  is 
insignificant.  With  7%  of  words  from  the  original  dataset  the  MDC  system  obtains 
only  2.5%  lower  accuracy  than  with  38%  of  words  (where  the  rest  appear  in  only  one 
or  two  documents  and  can  be  removed  with  high  confidence).  This  result  confirms 
that  high  frequency  words  are  important  for  genre  classification. 
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Comraf  accuracy  with  POS  ngrams  Comraf  accuracy  with  BOW 
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Figure  4.9.  Clustering  by  genre.  Micro-averaged  clustering  accuracy  of  Comraf 
models  as  a  function  of:  (left)  size  of  POS  n- gram  (1-grams,  2-grams,  3-grams  and 
4-grams);  (right)  threshold  on  low  frequency  words — a  point  i  on  the  X  axis  means 
that  in  this  experiment  words  that  appear  in  less  than  i  documents  are  removed. 


threshold  on  low  frequent  words 


4.10  Summary 

In  this  chapter,  we  have  proposed  the  objective  function  for  Comraf  clustering  and 
presented  two  inference  methods  in  Comrafs:  a  global  optimization  method  (MDC) 
and  a  local  optimization  method  (CWO).  Comraf  models  have  been  successfully  ap¬ 
plied  to  document  clustering.  We  have  tested  Comrafs  on  a  variety  of  clustering 
tasks: 


•  On  email  clustering  (see  Section  4.6),  a  bi- modal  Comraf  is  compared  with  three 
state-of-the-art  clustering  methods.  It  outperforms  a  (uni-modal)  sequential  IB 
method  because  it  benefits  from  the  multi-modal  nature  of  the  data.  The  ad¬ 
vantage  of  the  bi-modal  Comraf  over  the  bi-modal  (flat)  ITCC  method  suggests 
that  the  power  of  our  inference  algorithm  stems  from  a  better  exploitation  of 
the  clustering  hierarchy.  The  Comraf  model  demonstrates  superior  performance 
in  comparison  to  LDA — a  generative  graphical  model — because  Comrafs  pro¬ 
vide  a  more  flexible  modeling  environment  (see  Section  3.4).  Also,  we  provide 
evidence  that  extending  a  bi-modal  Comraf  to  3-modal  and  4-modal  setups  can 
further  improve  document  clustering  results. 
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•  In  Section  4.7  we  apply  a  Comraf  model  to  the  real-world  task  of  Web  ap¬ 
pearance  disambiguation  (WAD)  of  people  names.  We  show  that  it  slightly 
outperforms  a  strong  baseline  method  that  employs  link  structure  analysis  of 
Web  pages.  In  [13]  we  show  that  the  best  results  are  achieved  when  using  a 
hybrid  of  the  Comraf  clustering  and  link  structure  analysis.  In  Section  6.6.1 
we  will  show  a  better  method  for  WAD  that  is  based  on  one-class  clustering  of 
documents. 

•  In  Section  4.8  we  address  the  question  of  whether  Comrafs  have  more  modeling 
power  than  the  previously  proposed  multivariate  IB  framework  [97].  We  provide 
an  example  for  strict  superiority  of  Comraf  models. 

•  Finally,  in  Section  4.9  we  apply  Comrafs  to  a  non-topical  document  clustering 
task.  We  focus  on  clustering  by  genre  where  a  lexical  modality  (e.g.  words)  are 
used  in  conjunction  with  a  stylistic  modality  (POS  n-gr arris) .  Similar  Comraf 
models  can  be  applied  to  document  clustering  according  to  other  non-topical 
criteria,  such  as  readability.  In  Section  5.3  we  will  extend  the  non-topical  clus¬ 
tering  model  to  a  semi-supervised  case  and  test  it  on  clustering  by  author’s 
sentiment. 

Being  a  valid  graphical  model,  a  Comraf  takes  advantage  of  modeling  abilities  of 
existing  graphical  models.  For  example,  we  can  introduce  an  observed  state  through 
which  some  prior  knowledge  can  be  represented.  The  next  chapter  describes  a  result¬ 
ing  model. 
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CHAPTER  5 


COMRAFS  FOR  SEMI-SUPERVISED  LEARNING 


The  Comraf  model  is  a  convenient  framework  for  performing  semi-supervised  clus¬ 
tering  [16,  17]  (see  Section  5.1),  transfer  learning  [17]  (see  Section  5.2),  and  interactive 
clustering  [15]  (see  Section  5.3).  Prior  to  presenting  details  of  particular  Comrafs,  let 
us  define  the  concepts  of  hidden  and  observed  states  in  the  Comraf  model.  A  combi¬ 
natorial  r.v.  is  hidden  if  it  can  take  any  value  from  its  event  space.  A  combinatorial 
r.v.  is  observed  if  its  value  is  preset  and  fixed. 

5.1  Semi-supervised  clustering  with  Comrafs 

Semi-supervised  clustering  is  a  clustering  task  that  takes  advantage  of  labeled 
examples.  Usually,  semi-supervised  clustering  is  performed  when  the  number  of 
available  labeled  examples  is  not  sufficient  to  construct  a  good  classifier  (e.g.,  the 
constructed  classifier  would  overfit),  or  when  the  the  labeled  data  is  noisy  or  skewed 
to  a  few  classes.  Assuming  that  most  of  the  labeled  data  is  accurate,  our  goal  is  to 
incorporate  it  into  the  (unsupervised)  Comraf  model. 

In  this  thesis,  we  consider  only  a  uni-labeled  case  where  each  labeled  data  point 
xi\i=i  belongs  to  one  ground  truth  category  tj |j-=1.  We  propose  an  intrinsic  Com¬ 
raf  approach  for  incorporating  labeled  data  into  clustering  (by  introducing  observed 
nodes  to  a  Comraf  graph),  and  compare  it  with  existing  seeding  [7]  and  constrained 
optimization  [110]  schema. 

Intrinsic  approach.  Comrafs  offer  an  elegant  method  for  incorporating  labeled 
data,  which  does  not  require  any  significant  changes  in  the  clustering  model  proposed 
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in  Chapter  4.  First,  note  that  labels  define  a  natural  partitioning  of  the  labeled  data: 
for  each  label  tj  let  xo j  be  a  subset  of  X  labeled  with  tj ,  i.e.  xq j  =  {xi\ti  =  tj}.  We 
now  define  a  r.v.  A"0  over  the  partitioning  Xq  =  {xoj\j  =  1, . . . ,  k},  and  we  also  define 
a  combinatorial  r.v.  Xq  over  all  the  possible  partitionings  of  the  set  X.  Since  the 
partitioning  Xq  is  given  to  us,  the  variable  Xq  is  observed ,  with  Xq  being  its  fixed 
value.  Observed  combinatorial  random  variables  appear  shaded  on  a  Comraf  graph. 
The  objective  function  from  Equation  (4.4)  and  the  MPE  inference  procedure  remain 
unchanged  (with  the  only  difference  being  that  there  is  no  need  for  optimizing  the 
observed  nodes):  at  each  ICM  iteration  the  current  node  is  optimized  with  respect  to 
the  fixed  values  of  its  neighbors,  whereas  the  values  of  the  observed  nodes  are  fixed 
by  definition. 

Constrained  optimization.  Wagstaff  and  Cardie  [110]  perform  semi-supervised 
clustering  with  two  types  of  boolean  constraints.  The  must-link  constraint  ml  equals 
1  if  two  equally  labeled  data  points  are  assigned  into  different  clusters;  the  cannot-link 
constraint  cl  equals  1  if  two  differently  labeled  data  points  are  assigned  into  the  same 
cluster.  A  clustering  objective  function  incorporates  the  constraints,  e.g.  in  Comrafs 
(Equation  (4.4))  for  each  combinatorial  r.v.  X }  it  is: 


arg  max 


i'-.  (Xfpq>)eE 


where  the  weights  w^i  are  set  at  +oo,  which  means  that  all  constraints  must  be  sat¬ 
isfied.  Note  that  in  the  general  case  we  are  free  to  choose  any  non-negative  weights. 
In  order  to  fairly  compare  two  semi-supervised  methods,  for  both  of  them  we  must 
use  the  same  underlying  clustering  algorithm.  We  use  the  MDC  algorithm  (see  Sec¬ 
tion  4.3)  in  both  cases. 

Seeding  [7]  is  a  method  of  constructing  the  initial  clustering  of  both  labeled 
and  unlabeled  data  points,  for  which  the  must-link  and  cannot-link  constraints  are 


71 


Figure  5.1.  Comraf  graphs  for:  (left)  semi-supervised  clustering;  (right)  clustering 
with  transfer  learning. 


satisfied.  This  method  is  applied  to  Comraf  clustering  by  adapting  the  initialization 
step  of  the  MDC  algorithm  (see  Algorithm  2):  for  each  node  Xf  we  select  an  initial 
point  in  lattice  Li  that  satisfies  the  seeding  constraints.  Note  that,  in  contrast  to  the 
constrained  optimization  scheme  described  above,  in  the  seeding  scheme  the  cluster¬ 
ing  objective  function  remains  unchanged,  such  that  the  seeding  constraints  may  no 
longer  be  satisfied  during  the  course  of  the  MDC  algorithm. 

5.1.1  Experimentation 

Figure  5.1  (left)  shows  a  Comraf  graph  for  the  intrinsic  scheme  of  semi-supervised 
clustering.  Together  with  a  combinatorial  r.v.  Dc  over  document  clusterings  and 
a  combinatorial  r.v.  Wc  over  word  clusterings,  we  introduce  an  observed  node  Dq, 
whose  value  d q  is  a  given  partitioning  of  labeled  documents.  With  a  random  variable 
D0  defined  over  the  clusters  in  d^,  our  objective  derived  from  Equation  (4.3)  is: 

(dc*,  wc*)  =  argma x/(D;  W)  +  /(£>;  D0)  +  I(W;  D0). 

dc,wc 

As  mentioned  above,  the  ICM  optimization  procedure  remains  unchanged  and  iterates 
over  nodes  Dc  and  Wc  only  (the  observed  node  Dq  shall  not  be  optimized). 

It  is  interesting  to  note  that  the  seeding  approach  to  the  semi- supervised  clustering 
appears  to  be  useless  when  applied  to  Comrafs.  Despite  the  sophisticated  initializa¬ 
tion,  the  optimization  procedure  leads  to  the  same  local  maxima  of  the  objective,  as 
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in  the  case  of  trivial  initialization.  When  applied  to  document  clustering,  the  MDC 
algorithm  with  seeding  and  without  seeding  demonstrates  the  same  performance.  Be¬ 
low  we  compare  our  intrinsic  Comraf  scheme  with  the  constrained  optimization  only, 
which  is  naturally  robust  to  the  choice  of  a  particular  optimization  method. 

On  the  CALO  and  Enron  datasets  described  in  Section  4.6.2,  we  conduct  the 
following  experiment:  for  each  dataset,  we  uniformly  at  random  select  10%,  20%, 
or  30%  of  the  data  and  refer  to  it  as  labeled  examples  while  the  rest  of  the  data 
is  considered  unlabeled.  We  apply  both  intrinsic  and  constrained  methods  on  these 
three  setups  and  plot  the  micro-averaged  accuracy  (calculated  on  unlabeled  data  only) 
vs.  the  percentage  of  labeled  data  used.  The  results  (in  terms  of  clustering  accuracy 
as  defined  in  Section  4.6.1)  are  shown  in  Figure  5.2.  As  we  can  see  from  the  figure, 
both  methods  unsurprisingly  improve  the  unsupervised  results,  while  the  intrinsic 
Comraf  method  usually  outperforms  the  constrained  method. 

On  the  20NG  dataset,  we  select  10%  of  data  to  be  labeled.  The  constrained 
method  obtains  74.8  ±  0.6%  accuracy,  while  the  intrinsic  method  obtains  78.9  ± 
0.8%  accuracy  (over  5%  and  9%  absolute  improvement  to  the  unsupervised  result, 
respectively). 

The  intrinsic  scheme  is  resistant  to  noise.  To  show  this,  we  conduct  the  following 
experiment:  on  CALO  datasets  with  the  20%/80%  labeled/unlabcled  split,  we  arbi¬ 
trarily  corrupt  labels  of  10%,  20%  and  30%  of  the  labeled  data.  Figure  5.2(f)  shows 
that  clustering  accuracy  remains  almost  unchanged  for  all  three  datasets. 

5.2  Transfer  learning  with  Comrafs 

Transfer  learning  is  the  problem  of  applying  the  knowledge  learned  in  one  task 
to  effectively  solve  another  learning  task.  In  this  section,  we  represent  the  acquired 
knowledge  as  a  partitioning  y q  pre-built  for  data  y  that  can  be  used  for  constructing 
a  partitioning  xc  of  data  X .  We  note  that  the  intrinsic  scheme  for  semi-supervised 
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Figure  5.2.  Plots  (a)-(e):  comparing  accuracies  of  the  semi-supervised  Comraf  and 
the  constrained  optimization  method  on  five  email  datasets.  Plot  (f):  the  semi- 
supervised  Comraf’s  resistance  to  noise  in  labeled  data. 


clustering  presented  in  Section  5.1  above  allows  us  to  directly  use  labeled  data  not  only 
from  X  but  also  from  another  collection  y.  Thus,  in  analogy  to  the  semi-supervised 
case,  we  introduce  an  observed  combinatorial  r.v.  Y'0C  with  a  fixed  value  y^.  During 
the  inference  process,  we  construct  xc*  that  maximizes  agreement  (in  terms  of  mutual 
information)  with  the  labeled  data  f/g,  while  applying  the  same  objective  function  as 
in  Equation  (4.3)  and  the  same  ICM  optimization  procedure. 

We  set  up  a  transfer  learning  experiment  as  follows.  We  notice  that  in  two  of 
the  CALO  datasets  (acheyer  and  mgervasio)  similar  topics  are  discussed.  Our 
hypothesis  is  that  known  categories  of  one  dataset  can  improve  the  clustering  results 
on  another  dataset.  To  test  this  hypothesis,  we  first  consider  one  dataset  to  be 
labeled,  while  the  other  one  is  unlabeled,  and  then  vice  versa.  However,  since  the 
two  datasets  do  not  consist  of  the  same  documents,  we  decide  to  use  word  clusters  of 
the  labeled  dataset.  We  first  cluster  words  distributed  over  categories  of  the  labeled 
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dataset,  as  described  in  [11],  Then  we  introduce  the  constructed  word  clustering 
as  an  observed  node  Wq  into  the  Comraf  graph  (see  Figure  5.1  right)  and  perform 
the  ICM  inference.  Using  this  scheme  we  improve  the  micro-averaged  clustering 
accuracy  on  mgervasio  by  3%  absolute  over  unsupervised  clustering,  but  we  do 
not  see  any  change  in  accuracy  on  the  ACHEYER  dataset.  This  preliminary  result 
demonstrates  the  usability  of  Comrafs  for  transfer  learning;  other  types  of  Comraf 
models  for  transfer  learning  are  emerging. 

5.3  Interactive  clustering  with  Comrafs 

In  interactive  clustering  of  text  collections,  the  user  is  actively  involved  in  the 
process  of  clustering  documents,  their  features,  or  both  (see,  e.g.,  [53]).  Being  thus 
provided  with  some  level  of  supervision,  the  interactive  clustering  scheme  can  be 
viewed  as  an  instance  of  semi- supervised  learning.  In  Sections  5.1  and  5.2  above, 
we  have  shown  how  to  incorporate  prior  knowledge  into  the  Comraf  graph  G ,  while 
using  the  same  objective  or  inference  algorithm  as  in  unsupervised  clustering.  Here, 
we  incorporate  prior  knowledge  into  our  inference  algorithm ,  preserving  the  Comraf 
graph  and  the  objective  (4.3)  of  the  unsupervised  case. 

In  [15],  we  proposed  interactive  clustering  as  a  unified  framework  for  clustering 
document  collections  according  to  nearly  any  criterion  of  the  users  choice:  docu¬ 
ments’  style,  readability,  credibility;  authors’  age,  mood,  sentiment,  familiarity  with 
the  topic  etc.  (for  the  beginning  of  the  discussion  and  an  example  of  clustering  by 
genre,  see  Section  4.9).  The  user  is  first  asked  to  choose  modalities  (or  types  of  fea¬ 
tures)  suitable  for  clustering  by  the  desired  criterion.  In  clustering  by  genres,  for 
example,  documents  may  be  represented  over  sequences  of  Part-Of-Speech  (POS) 
tags,  punctuation  marks,  stopwords,  as  well  as  over  general  words  as  captured  in  the 
standard  BOW  representation.  The  user  is  next  asked  to  provide  a  few  examples  of 
features  ( seed  features )  of  the  chosen  types,  if  such  examples  are  intuitive  and  can 
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be  obtained  without  much  effort — e.g.,  when  clustering  by  authors  mood,  words  like 
‘angry’,  ‘happy’,  ‘upset’  might  be  easily  suggested. 

The  clustering  system  then  represents  documents  based  on  the  users  choice  and 
applies  a  Comraf  clustering  method.  When  seed  features  are  provided,  the  system 
iteratively  clusters  documents  represented  over  the  chosen  features  and  then  enriches 
feature  sets  with  other  useful  features.  The  user  can  choose  to  intervene  (or  not)  after 
each  iteration,  in  order  to  fix  possible  mistakes  made  by  the  system  on  the  feature 
level  (no  document  labeling  is  required). 

In  this  section,  we  illustrate  the  effectiveness  of  our  approach  on  clustering  by 
author’s  sentiment  [107].  In  clustering  by  sentiment,  data  categories  correspond  to 
different  levels  of  the  authors’  attitude  to  the  discussed  topic  (e.g.  liked/disliked,  sat¬ 
isfied/unsatisfied  etc.)  The  categories  can  be  finer  grained  ( strongly  liked  /  somewhat 
liked  etc.) — as  long  as  it  is  possible  to  distinguish  between  two  adjacent  categories. 
We  perform  interactive  clustering  within  a  bi-modal  Comraf  framework,  where  doc¬ 
uments  and  words  are  clustered  simultaneously.  The  user  is  involved  in  the  process 
of  clustering  words  (it  is  easier  for  the  user  to  be  involved  in  clustering  words  than  in 
clustering  documents  [90]). 

5.3.1  Related  work 

There  has  been  work  on  interactive  topical  clustering  where  the  user  corrects  clus¬ 
tering  errors  on  a  document  basis  [8],  but  that  effort  is  more  time  consuming  than 
feedback  on  features  [90].  Other  recent  work  has  had  the  user  select  important  key¬ 
words  for  (supervised)  categorization,  thereby  leveraging  the  user’s  prior  knowledge 
[31,  90] — approaches  that  are  more  like  that  of  our  framework.  Raghavan  et  al.  [90] 
further  support  this  direction  in  the  finding  that  users  can  identify  useful  features 
with  reasonable  accuracy  as  compared  to  an  oracle.  Liu  et  ah  [72]  experiment  with 
labeling  words  instead  of  documents  for  text  classification,  providing  the  user  with  a 
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list  of  candidate  words  from  which  to  select  potentially  good  seed  words,  based  on 
which  a  training  set  is  constructed  from  a  set  of  unlabeled  documents.  A  classifier 
is  then  constructed  given  this  training  set.  Liu  et  al.’s  document  representation  is 
the  standard  BOW,  which  has  strong  topical  flavor,  and  therefore  cannot  be  used  for 
clustering  by  arbitrary  criteria  (for  example,  our  preliminary  experiments  show  that 
BOW  is  not  appropriate  for  clustering  by  author’s  mood).  In  addition,  Liu  et  al.’s 
method  involves  the  user  only  at  the  initial  step  (selecting  seed  words),  limiting  the 
user’s  control  of  the  classification  process. 

Although  the  supervised  task  of  classification  by  sentiment  has  been  widely  ad¬ 
dressed  in  the  literature  (see,  e.g.  [84]  and  references  therein),  clustering  by  sentiment 
has  been  very  sparsely  covered.  Turney  [107]  performs  a  binary  clustering  of  product 
reviews  by  authors’  sentiment,  where  only  two  clusters  of  documents  are  constructed: 
positive  reviews  and  negative  reviews.  We  are  not  aware  of  previous  work  on  cluster¬ 
ing  by  sentiment  that  goes  beyond  the  binary  approach.  In  this  section,  however,  we 
cluster  reviews  into  four  groups,  corresponding  to  the  categories  of  strongly  positive, 
somewhat  positive,  somewhat  negative  and  strongly  negative  reviews. 

5.3.2  Interactive  clustering  scenario 

Here  we  provide  a  step-by-step  recipe  for  clustering  documents  by  a  particular 
criterion  that  the  user  has  in  mind: 

1.  Specify  the  number  of  clusters:  Learning  the  natural  number  of  clusters 
still  remains  an  open  problem.  We  do  not  attempt  to  solve  it  in  this  thesis, 
instead  the  user  is  asked  to  specify  the  desired  number  of  clusters. 

2.  Specify  feature  types:  A  list  of  various  feature  types  is  provided  to  the  user. 
Examples  of  such  types  are:  bag  of  words  or  word  n-gr arris,  POS  tags  or  POS  tag 
n-grams,  punctuation,  parse  subtrees  and  other  types  of  syntactic  and  semantic 
patterns  that  can  be  extracted  from  text.  Such  a  list  can  hypothetically  include 
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a  large  variety  of  feature  types  that  would  respond  to  everyone’s  needs.  From 
this  list  the  user  is  asked  to  choose  one  or  more  types  that  best  serve  the 
particular  clustering  criterion. 

3.  Give  examples  of  features:  For  each  feature  type  chosen,  the  user  should  at¬ 
tempt  to  construct  (small)  sets  of  seed  features  that  correspond  to  each  category 
of  documents.  Sometimes  this  task  is  easy:  e.g.,  if  the  clustering  criterion  is 
authors’  sentiments,  then  words  such  as  ‘excellent’,  ‘brilliant’  etc.  would  corre¬ 
spond  to  the  category  of  positive  documents,  while  ‘terrible’,  ‘awful’  etc.  would 
correspond  to  the  negative  category.  However,  when  such  sets  cannot  be  easily 
constructed  (e.g.  it  is  non-trivial  to  come  up  with  good  feature  examples  for 
clustering  by  genre — see  Section  4.9),  the  user  can  skip  this  step. 

4.  Default  clustering:  If  m  feature  types  are  chosen,  but  no  seed  features  are 
provided  by  the  user,  the  standard  (unsupervised)  clustering  scheme  is  applied 
(see  Chapter  4). 

5.  Interactive  Clustering:  For  the  cases  when  the  user  has  provided  seed  fea¬ 
tures  for  some  of  the  feature  types,  we  propose  a  new  model  for  Comraf  clus¬ 
tering,  which  combines  regular  clustering  of  non-seeded  variables  with  an  incre¬ 
mental,  bootstrapping  procedure  for  seeded  variables: 

(a)  Represent  documents  as  distributions  over  the  sets  of  seed  features.  Ig¬ 
nore  documents  with  zero  probability  given  the  seed  features.  Cluster  the 
remaining  documents  using  a  Comraf  clustering  method. 

(b)  Stop  if  most  documents  have  been  clustered  (for  details,  see  Section  5.3.3 
below). 

(c)  Represent  all  features  of  the  clustered  documents  as  distributions  over  the 
document  clusters.  Ignore  features  that  have  zero  probability  given  the 
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clustered  documents.  Cluster  the  remaining  features  using  the  distribu¬ 
tional  clustering  method. 

(d)  Select  feature  clusters  that  contain  the  original  seed  words.  Let  the  user 
revise  the  selected  clusters:  noisy  features  can  be  deleted;  misplaced  fea¬ 
tures  can  be  relocated;  new  features  can  be  added.  The  revised  clusters  of 
features  are  the  new  sets  of  seed  features.  Go  to  5(a). 

5.3.3  Clustering  by  sentiment 

Following  the  procedure  described  in  Section  5.3.2  above,  after  choosing  the  num¬ 
ber  of  clusters  and  particular  feature  types,  the  user  is  asked  to  select  a  few  seed 
features  for  each  category.  For  clustering  by  sentiment,  as  well  as  for  somewhat  sim¬ 
ilar  tasks  of  clustering  by  authors’  mood  or  by  familiarity  with  the  topic,  relevant 
feature  types  may  be  words  or  word  n-gr arris  (i.e.  semantic  features).  However,  for 
other  quite  close  tasks,  e.g.  clustering  by  authors’  age,  not  only  semantics  but  also 
syntax  can  matter:  children,  for  instance,  use  certain  words  more  often  than  adults 
do;  children  also  tend  to  use  primitive  (and  sometimes  erroneous)  syntactic  construc¬ 
tions  (“me  going  bye-bye”  etc.).  In  this  section,  for  simplicity,  we  experiment  with 
word  features  only. 

The  task  of  selecting  “sentimental”  seed  words  has  two  issues.  First,  it  is  easier  to 
come  up  with  words  that  correspond  to  extreme  sentimental  categories  (‘spectacular’, 
‘horrible’),  but  it  is  difficult  to  choose  seed  words  for  intermediate,  mild  categories. 
Nevertheless,  as  we  will  see  in  Section  5.3.6  users  usually  succeed  in  accomplishing 
this  task.  Second,  in  our  early  experiments,  users  consistently  tended  to  choose 
words  that  were  out  of  the  vocabulary  of  a  given  dataset.  Inspired  by  Liu  et  al.  [72], 
we  decided  to  provide  the  users  with  a  word  list,  to  narrow  her  search  only  to  the 
dataset  vocabulary.  Unlike  Liu  et  al.  [72],  whose  task  is  topical  clustering,  we  cannot 
automatically  predict  which  words  would  be  relevant.  Instead,  we  employ  Zipf’s  law 
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and  provide  the  user  with  a  list  of  words  from  the  interior  of  the  frequency  spectrum. 
We  anticipate  such  a  list  to  contain  the  most  relevant  seed  words. 

We  then  perform  an  iterative  process  of  clustering  that  allows  user’s  involvement 
in  between  clustering  iterations.  We  apply  a  bi-modal  Comraf  model:  we  first  cluster 
documents  that  contain  the  selected  seed  words  and  then  we  cluster  all  words  of  these 
documents.  In  the  latter  step,  our  seed  word  groups  are  enriched  with  new  words  that 
have  been  clustered  together  with  the  original  seed  words.  The  user  is  then  asked 
to  edit  the  new  seed  word  groups,  in  order  to  correct  possible  mistakes  made  by 
the  system  (word  removal,  relocation  and  addition  is  allowed).  By  this,  a  clustering 
iteration  is  completed  and  the  next  iteration  can  be  executed. 

Since  the  seed  word  groups  have  been  enlarged,  we  can  expect  that  a  set  of  doc¬ 
uments  that  contain  these  seed  words  is  now  larger  as  well,  so  that  the  clustering 
process  will  cover  more  and  more  documents  from  iteration  to  iteration.  The  process 
stops  when  no  more  documents  are  added  to  the  pool.  Documents  that  have  never 
been  covered  (the  ones  that  contain  no  seed  words  from  the  largest  seed  word  groups) 
are  considered  to  be  clustered  incorrectly.  An  alternative  approach  to  guarantee  the 
algorithm’s  convergence  would  be  to  require  enlargement  of  seed  word  groups  such 
that  at  least  one  document  is  added  to  the  clustering  at  each  iteration.  The  algorithm 
would  then  stop  when  the  entire  dataset  is  covered.  We  choose  the  former  approach 
because  (a)  we  do  not  want  to  put  additional  constraints  either  on  the  user  or  on 
the  Comraf  clustering  model;  (b)  in  each  real-world  dataset  there  can  be  documents 
whose  sentimental  flavor  is  hard  to  identify — it  would  not  be  beneficial  to  force  such 
documents  into  any  of  the  sentimental  clusters. 

5.3.4  Dataset 

We  evaluate  our  interactive  clustering  system  on  a  dataset  of  movie  reviews.  Our 
dataset  consists  of  1613  reviews  written  on  “Harry  Potter  and  the  Goblet  of  Fire 
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(2005)”  that  we  downloaded  from  IMDB .  com  in  May  2006. 1  The  data  was  preprocessed 
exactly  as  the  BNC  corpus  (Section  4.9.1).  We  ignore  reviews  that  do  not  have  rating 
scores  assigned  by  the  user.  The  IMDB’s  scoring  system  is  from  1  (the  worst)  to 
10  (the  best).  Based  on  our  extensive  experience  with  IMDB.com,  we  translate  these 
scores  into  four  categories  as  follows:  scores  1  to  4  are  translated  into  the  category 
strongly  disliked  (292  documents),  scores  5  to  7  are  translated  into  somewhat  disliked 
(454  documents),  scores  8  and  9  into  somewhat  liked  (447  documents),  and  score  10 
is  translated  into  the  category  strongly  liked  (420  documents).  We  do  not  introduce 
a  neutral  category  because  there  are  very  few  neutral  reviews  on  IMDB .  com. 

5.3.5  Experimental  setup 

On  the  task  of  clustering  by  sentiment,  we  compare  our  method’s  performance 
with  that  of  k- means  and  LDA  (Section  4.6.3),  as  well  as  with  the  performance  of 
an  SVM  classifier  trained  on  22,476  movie  reviews.  The  training  data  for  the  SVM 
consisted  of  reviews  of  46  popular  Hollywood  movies  released  in  2005,  of  the  same 
genre  as  Harry  Potter.  The  reviews  and  genre  labels  of  movies  are  obtained  from 
IMDB.com.  Again,  we  ignore  reviews  without  user-assigned  rating. 

To  compare  our  Comraf  clustering  with  other  clustering  methods,  we  again  use 
the  micro-averaged  clustering  accuracy,  as  described  in  Section  4.6.1.  It  is  not  obvious 
however  how  to  compare  Comraf  clustering  results  with  SVM  classification  results. 
In  [16],  we  show  that  the  clustering  accuracy  can  be  directly  compared  with  the 
(standard)  classification  accuracy  if  a  constructed  clustering  is  well-balanced ,  meaning 

1Bo  Pang  [84]  maintains  a  popular  dataset  of  movie  reviews  that,  unfortunately,  does  not  fully 
correspond  to  our  task  because  (a)  we  want  to  differentiate  the  problem  of  clustering  by  sentiment 
from  the  topical  clustering — for  this  reason  our  dataset  contains  reviews  written  on  one  movie  only, 
so  that  the  topic  of  all  the  reviews  is  potentially  the  same;  (b)  movie  ratings  in  Bo  Pang’s  dataset 
are  extracted  from  the  reviews’  text,  which  is  an  error-prone  procedure,  whereas  in  our  dataset  the 
ratings  are  assigned  by  the  reviewers  using  an  HTML  form  which  leaves  no  room  for  errors. 
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Doc  repres. 

/c-means 

LDA 

Cornraf 

SVM 

BOW 

28.2 

37.0  ±0.2 

40.3  ±0.8 

39.1  ±0.3 

Sentim.  list 

29.0 

40.2  ±0.5 

43.0  ±0.9 

41.3  ±0.6 

Interactive  clustering  ( Oracle ) 

47.1  ±  0.2 

n/a 

Simulated  classification  ( Oracle ) 

46.3  ±0.1 

Table  5.1.  Clustering  by  sentiment.  Clustering  accuracy  of  Cornraf  models  (both 
interactive  and  non-interactive)  is  compared  with  clustering  accuracy  of  /e-means  and 
LDA,  as  well  as  with  classification  accuracy  of  SVM.  All  results  are  averaged  over 
four  independent  runs.  Standard  error  of  the  mean  is  shown  after  the  ±  sign. 


that  each  category  prevails  exactly  in  one  cluster.  It  appears  that  all  our  clusterings 
obtained  using  the  Cornraf  model  are  well-balanced. 

The  system  is  evaluated  on  five  users  who  are  familiar  with  the  task  of  document 
clustering.  The  users  were  explained  the  idea  behind  interactive  clustering  and  pro¬ 
vided  a  brief  description  of  the  dataset.  They  were  given  a  list  of  563  words  that 
appeared  in  50  <  n  <  500  documents  in  our  dataset.  The  users  proceeded  as  de¬ 
scribed  in  Section  5.3.2.  Also,  we  construct  an  oracle  as  follows:  for  each  category  t 
we  select  25  most  frequent  words  that  belong  to  a  given  list  of  sentimental  words2  and 
their  distribution  over  the  categories  has  a  peak  at  t.  Unlike  human  users,  the  oracle 
does  not  provide  feedback  between  clustering  iterations.  To  some  extent,  the  oracle’s 
performance  can  be  considered  as  an  upper  bound  to  results  obtained  in  practice, 
when  a  human  user  is  involved. 

We  perform  a  simulated  classification  (SC)  experiment  analogous  to  the  one  of  Liu 
et  al.  [72]  (see  a  description  in  Section  5.3.1),  where  the  seed  words  are  provided  by 
our  oracle.  We  replace  an  ad-hoc  kNN-like  clustering  in  Liu  et  al.’s  implementation 
by  our  effective  Cornraf  clustering,  and  a  Naive  Bayes  classifier  by  an  SVM. 

2Our  list  of  4295  sentimental  words  was  obtained  as  described  in  [38]. 
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Figure  5.3.  Interactive  clustering  by  sentiment.  Micro- averaged  clustering  accuracy 
over  various  users:  (left)  over  interactive  learning  iterations  (with  original  seed  words 
only,  after  one  correction  step  and  after  two  correction  steps).  The  horizontal  line  is 
SVM  performance  (after  feature  extraction  using  a  given  list  of  sentimental  words, 
and  after  training  on  over  20K  documents);  (right)  over  categories  of  the  dataset  after 
two  correction  steps. 


5.3.6  Comparative  results 

Table  5.1  summarizes  our  observations.  Surprisingly,  with  BOW  features,  our 
Comraf  clustering  method  performs  as  well  as  an  SVM  trained  on  a  large  amount 
of  data  (Row  1).  The  good  performance  of  our  unsupervised  method  (with  BOW) 
indicates  that  the  constructed  topical  clustering  sheds  some  light  on  reviewers’  sen¬ 
timents,  which  can  occur  when  the  reviewers  have  a  consensus  on  certain  aspects  of 
the  movie,  e.g.  liked  the  actors  but  disliked  the  plot  etc. 

After  feature  selection  according  to  our  list  of  sentimental  words,  the  Comraf 
achieves  a  significant  boost  in  accuracy  surpassing  the  SVM  (Row  2).  Using  an 
oracle  in  our  interactive  clustering  setup  (Row  3)  improves  the  performance  even 
further,  while  the  SC  result  (Row  4)  is  only  slightly  (but  significantly)  inferior.  These 
two  results  are  close  because  the  training  set  of  SC  is  identical  to  the  clustering 
constructed  at  the  first  iteration  of  the  Comraf  algorithm.  Since  its  size  appears  to 
be  over  3/4  of  the  entire  dataset,  there  is  almost  no  room  for  the  actual  diversity  in 
performance  of  the  two  methods. 
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Figure  5.3  (left)  shows  the  micro-averaged  clustering  accuracy  for  each  user  and 
each  iteration.  For  three  of  the  five  users,  selection  of  the  initial  seed  words  is  suf¬ 
ficient  to  obtain  significantly  higher  accuracy  than  the  best  result  of  the  SVM.  User 
2  has  significantly  lower  accuracy  than  the  baseline  to  begin  with,  but  over  the  two 
correction  steps  is  able  to  provide  the  necessary  feedback  so  as  to  obtain  an  improve¬ 
ment  in  accuracy,  equaling  the  baseline.  We  found  that  User  2  was  fairly  conservative 
in  her  assessment  of  terms  in  the  beginning  marking  only  26  terms,  while  User  1 
(the  one  with  the  best  average  performance)  marked  58  terms,  23  of  which  were  in 
common  with  User  2.  User  4  reported  that  she  aggressively  removed  words  at  the 
first  correction  step,  which  caused  a  noticeable  drop  in  the  performance. 

Figure  5.3  (right)  shows  the  accuracy  per  class,  per  user  at  the  end  of  3  itera¬ 
tions.  User  1  and  User  2  have  near  identical  accuracies  on  the  two  extreme  categories 
( strongly  liked  and  strongly  disliked),  but  User  1  has  higher  accuracies  on  the  inter¬ 
mediate  categories,  resulting  in  higher  micro-averaged  accuracy.  It  is  apparent  from 
this  figure  that  users  are  able  to  come  up  with  good  features  for  the  two  extreme  cat¬ 
egories,  but  have  difficulties  with  the  intermediate  categories.  The  figure  also  shows 
the  performance  of  SVM  (with  sentiment  features).  It  is  interesting  to  note  that  the 
SVM’s  pattern  of  behavior  is  almost  identical  to  the  interactive  Comraf’s. 

5.4  Summary 

In  this  chapter,  we  have  shown  that  Comrafs  can  be  straightforwardly  applied 
to  semi-supervised  learning,  while  either  adjusting  the  Comraf  graph  or  the  Comraf 
inference  algorithm.  As  the  semi-supervised  setup  can  be  viewed  as  an  instance  of  a 
supervised  setup,  we  can  make  a  statement  that  Comrafs  are  applicable  to  the  entire 
spectrum  of  machine  learning  tasks. 

On  the  task  of  semi-supervised  clustering,  we  showed  that  Comraf  models  out¬ 
perform  a  popular  constrained  optimization  method.  We  also  showed  that  Comraf 
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models  are  very  robust  to  the  noise  in  data  labels.  Our  preliminary  results  demon¬ 
strate  applicability  of  the  Comraf  framework  to  transfer  learning,  which  has  a  variety 
of  interesting  applications. 

We  also  applied  the  Comraf  framework  to  non-topical  clustering  of  documents,  by 
introducing  the  interactive  clustering  model.  We  showed  that  interactive  clustering, 
which  is  a  semi-supervised  version  of  an  unsupervised  clustering  scheme,  can  poten¬ 
tially  outperform  one  of  the  best  supervised  learning  methods  (SVM),  trained  on  a 
large  amount  of  labeled  data.  This  result  raises  an  important  question  that  has  not 
been  widely  addressed  in  the  machine  learning  literature:  for  a  particular  unlabeled 
dataset,  would  it  be  more  beneficial  to  train  a  supervised  model  on  similar,  yet  differ¬ 
ent,  data  (such  as,  train  a  classifier  on  reviews  of  movie  A  and  apply  it  to  reviews  of 
movie  B),  or  it  would  be  better  to  construct  a  clustering  model  that  takes  advantage 
of  some  limited  knowledge  on  the  unlabeled  data,  as  provided  by  the  user?  It  appears 
that  the  former  approach  is  quite  popular.  In  this  chapter  we  provided  evidence  that 
the  latter  approach  can  be  more  effective. 
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CHAPTER  6 


COMRAFS  FOR  ONE-CLASS  CLUSTERING 


As  we  discussed  in  Section  3.4,  each  Comraf  model  is  a  trinity  of  a  Comraf  graph 
G,  an  objective  function  that  is  factored  over  G,  and  an  inference  procedure  for  op¬ 
timizing  this  objective  function.  So  far,  we  have  experimented  with  various  Comraf 
graphs  (with  or  without  observed  nodes)  for  multi-modal  clustering,  where  the  objec¬ 
tive  is  a  sum  of  pairwise  Mutual  Information  terms  (3.1),  and  the  inference  procedure 
is  a  variant  of  MDC  (see  Section  4.3).  Exploring  the  variety  of  Comraf  graphs  led 
us  to  proposing  models  for  email  clustering  (Section  4.6),  clustering  scientific  papers 
(Section  4.8),  document  clustering  by  genre  (Section  4.9),  semi-supervised  cluster¬ 
ing  (Section  5.1),  and  clustering  with  transfer  learning  (Section  5.2).  In  Section  5.3, 
however,  we  went  beyond  this  scope  and  proposed  an  enhancement  to  the  MDC  infer¬ 
ence  procedure  that  led  to  an  interactive  clustering  model.  Note  that  the  interactive 
clustering  model  exploits  the  same  Comraf  graph  and  objective  function  as  other 
multi-modal  clustering  methods  we  proposed.  This  chapter  goes  further  in  inves¬ 
tigating  the  role  of  the  objective  function  in  Comrafs.  Specifically,  we  focus  on  the 
problem  of  constructing  an  objective  function  that  best  suits  a  particular  application. 

We  address  this  problem  on  a  representative  task  of  one-class  clustering ,  which 
is  the  task  of  identifying  the  most  coherent  subset  of  documents  {the  core )  from  a 
given  pool  of  documents.  This  pool  can  be  generated  by  a  search  engine  (as  a  set 
of  documents  retrieved  on  a  given  query);  also,  this  pool  can  be  an  email  Inbox,  a 
repository  of  scientific  papers  etc.  One-class  clustering  is  a  technically  simpler  task 
than  the  general  (multi-class)  clustering:  on  a  given  dataset,  a  binary  (as  opposed 


to  a  k- ary)  predicate  is  constructed  that  answers  the  question  of  whether  or  not  a 
data  instance  belongs  to  the  core.  This  simplicity  allows  for  a  theoretical  analysis  of 
optimality  of  the  one-class  clustering  method  proposed. 

Similar  to  many  other  unsupervised  learning  problems,  the  problem  of  one-class 
clustering  is  generally  ill-posed  as  one  can  argue  that  the  shortest  document  in  a  col¬ 
lection  satisfies  the  criterion  of  being  the  most  coherent  subset.  We  resolve  that  issue 
by  introducing  a  parameter  k,  which  is  the  number  of  documents  in  the  core  subset. 
This  parameter  is  analogous  to  the  number  of  clusters  in  (multi-class)  clustering,  the 
number  of  outliers  [105]  or  the  radius  of  Bregmanian  ball  [28]  in  other  formulations 
of  one-class  problems. 

Note  that  formally  the  problem  of  one-class  clustering  is  a  special  case  of  the 
general,  multi-class  clustering:  one-class  clustering  is  a  problem  of  constructing  n  — 
k  +  1  clusters  of  n  data  instances,  where  one  cluster  is  of  size  k  and  all  the  others 
are  singletons.  However,  since  explicit  modeling  of  singleton  clusters  appears  to  be 
useless,  from  the  practical  point  of  view  the  two  problems  become  different:  methods 
applicable  for  one-class  clustering  are  generally  unapplicable  to  multi-class  clustering 
and  vice  versa.  Also  note  that  the  problem  of  one-class  clustering  is  a  compliment  to 
an  unsupervised  formulation  of  the  outlier  detection  problem  [1,  105]:  once  the  core 
cluster  is  constructed,  all  the  non-core  data  instances  are  considered  outliers. 

Speaking  in  terms  of  Comraf  models,  for  one-class  clustering  we  define  combi¬ 
natorial  random  variables  over  all  the  possible  subsets  of  a  modality  (or,  in  other 
words,  over  its  powerset).  Recall  that  for  multi-class  clustering  we  defined  combina¬ 
torial  random  variables  over  all  the  possible  partitionings  of  a  modality.  Although 
the  Comraf  graph’s  layout  appears  to  be  the  same  for  both  tasks,  the  one-class  clus¬ 
tering  objective  function  is  different  from  that  of  multi-class  clustering,  and  so  is  the 
inference  procedure.  In  this  chapter,  we  construct  step-by-step  the  objective  function 
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and  inference  procedure,  starting  from  an  artificially  simplified  case  and  ending  with 
the  real-world  application. 

Our  working  assumption  throughout  this  chapter  is  that  the  core  documents  share 
a  (relatively)  small  lexicon,  while  the  remaining  documents  ( the  noise )  do  not  have 
much  in  common  (i.e.  they  are  randomly  drawn  from  the  pool  of  all  existing  documents 
written  in  the  English  language).  Our  methods,  however,  will  work  equally  well  in 
situations  when  the  noise  has  some  structure,  meaning  that  some  non-core  documents 
share  their  topics. 

We  describe  the  simplest  Comraf  model  with  only  two  modalities:  documents 
and  words.  Despite  its  simplicity,  this  setup  allows  for  three  different  approaches  to 
one-class  clustering  of  documents: 

•  Identify  the  shared  lexicon  (the  subset  of  relevant  words),  i.e.  solve  the  one- 
class  clustering  problem  for  words.  A  document  will  then  be  considered  a  part 
of  the  core  if  it  contains  enough  relevant  words.  We  describe  this  setup  in 
Section  6.2.  The  fundamental  question  we  answer  in  that  section  is  whether  or 
not  the  subset  of  relevant  words  can  be  identified  in  document  collections  of 
feasible  size.  We  show  analytically  that,  under  some  simplifying  assumptions, 
the  subset  of  relevant  words  can  be  optimally  identified  in  document  collections 
of  log- linear  size  (in  the  size  of  the  vocabulary). 

•  Directly  identify  the  core  documents,  based  on  their  distributions  over  words. 
This  setup  is  in  the  focus  of  Section  6.3.  In  that  section,  we  propose  our 
information-theoretic  objective  function.  We  derive  a  simple  uni-modal  algo¬ 
rithm  for  optimizing  this  objective.  We  show  that  the  proposed  algorithm  is 
optimal  under  the  assumptions  imposed  in  Section  6.2.  We  then  relax  these 
assumptions  and  adjust  our  objective  function  to  the  real-world  case. 


•  Perform  one-class  co-clustering  (OCCC),  while  simultaneously  identifying  the 
subset  of  relevant  words  and  the  subset  of  core  documents  (see  Section  6.4).  We 
generalize  the  algorithm  proposed  in  Section  6.3  to  the  bi-modal  setup.  The 
resulting  OCCC  algorithm  significantly  outperforms  the  uni-modal  one. 

In  Section  6.5,  we  propose  another,  probabilistic  objective  function  for  our  task:  the 
likelihood  that  a  document  belongs  to  the  core.  Inspired  by  Huang  and  Mitchell  [53], 
we  apply  an  EM  inference  algorithm  to  the  resulting  model. 

We  evaluate  our  information-theoretic  and  probabilistic  models  on  two  applica¬ 
tions:  (a)  Web  appearance  disambiguation  (see  Section  4.7) — our  methods  outper¬ 
form  the  algorithm  proposed  in  [13];  and  (b)  re-ranking  information  retrieval  re¬ 
sults  [65,  36] — we  significantly  improve  the  accuracy  of  original  Google’s  ranked  lists, 
as  well  as  of  one-class  (unsupervised)  S VM  and  one-class  Information  Bottleneck  [28] . 
Note  that  our  models  can  also  be  applied  to  other  real-world  tasks,  e.g.  to  spam  de¬ 
tection,  news  filtering,  image  retrieval,  and  basically  to  any  task  where  a  common 
subset  of  features  can  be  identified  in  a  subset  of  data  instances. 

6.1  Related  work 

Many  previously  proposed  one-class  clustering  methods  (see  [105,  28,  50],  and 
references  therein)  are  vector-space  methods,  where  the  goal  is  to  End  a  convex  body 
of  small  volume  that  contains  as  many  data  instances  as  possible.  Despite  that  binary 
vector-space  methods  have  proven  themselves  to  be  very  effective  in  the  text  domain, 
one-class  vector-space  methods  are  problematic.  In  binary  methods,  the  decision 
boundary  is  linear  (with  or  without  applying  the  kernel  trick  [29]).  In  (vector-space) 
one-class  methods,  however,  the  boundaries  are  essentially  elliptic,  which  is  unnatural 
in  the  highly  multidimensional  text  domain:  core  documents  tend  to  lie  on  a  lower¬ 
dimensional  manifold  (see  [68]),  while  elliptic  boundaries  tend  to  capture  too  much 
space  around  it. 


An  alternative  solution  suggested  in  [13]  (and  discussed  in  Section  4.7)  is  to  sim¬ 
ulate  one-class  clustering  in  text  by  first  applying  traditional  multi-class  clustering, 
after  which  one  of  the  clusters  in  chosen.  Intuitively,  this  approach  makes  a  wrong  de¬ 
sign  choice:  structure  is  artificially  forced  on  the  space  of  non-core  documents,  which 
may  not  have  any  underlying  structure.  The  models  described  in  this  chapter,  in 
contrast,  achieve  the  main  goal  of  one-class  clustering — to  identify  the  most  coherent 
subset  of  objects — without  imposing  structural  or  topological  constraints. 

Our  one-class  clustering  models  have  interesting  cross-links  with  models  applied  to 
other  Information  Retrieval  tasks.  For  example,  a  model  similar  to  our  information- 
theoretic  one-class  clustering,  is  proposed  by  Zhou  and  Croft  [115]  for  query  per¬ 
formance  prediction.  Tao  and  Zliai  [104]  describe  a  pseudo-relevance  feedback  model, 
which  is  similar  to  our  probabilistic  one-class  clustering  (see  discussion  in  Section  6.6.2) 
These  types  of  cross-links  are  common  when  the  models  are  general  enough  and  rel¬ 
atively  simple.  In  this  work  we  pay  particular  attention  to  the  simplicity  of  our 
models,  such  that  they  are  feasible  for  theoretical  analysis  as  well  as  for  efficient 
implementation. 

6.2  One-class  clustering  of  words 

We  are  given  a  dataset  V  of  n  documents,  each  of  which  is  represented  as  a  vector 
of  words,  with  no  importance  to  their  order  (i.e. ,  bag-of-words ).  We  assume  that 
T>  has  a  core  T>k  of  k  documents  written  on  one  topic,  while  the  rest  of  the  (n  —  k ) 
documents  are  noise.  Let  TZ  be  the  lexicon  of  relevant  words  and  Q  D  TZ  be  the  general 
lexicon  of  V  (i.e.  all  distinct  words  of  T>).  Let  us  denote  m  =  \Q\  and  mr  =  \7Z\  the 
sizes  of  the  two  lexicons,  where  mr  <C  m.  Assuming  that  the  core  is  not  too  small 
(~  ^>  0),  our  intuition  is  that  a  word  belongs  to  7 Z  if  it  is  more  frequently  used  in  V 
than  it  would  be  used  in  general  English.  For  example,  many  occurrences  of  the  words 
“reinforcement”,  “regression”,  “classifier”  in  T>  indicate  that  they  are  relevant,  as  the 


90 


Figure  6.1.  (left)  The  simplest  generative  model;  (right)  Latent  Topic/Background 
model  (Section  6.5). 

probability  of  observing  the  same  frequency  of  these  words  in  non-core  documents  is 
very  low.  Our  first  task  is  to  determine  which  words  belong  to  1Z. 

We  attempt  to  solve  this  problem  by  introducing  a  simple  generative  model  of 
documents  (see  the  left  panel  of  Figure  6.1).  Given  a  dataset  V  of  size  n,  for  each 
word  token  in  every  document,  we  first  decide  if  it  is  drawn  from  a  distribution  Pr(W ) 
over  the  the  set  1Z  of  relevant  words,  or  from  a  distribution  Pg(W)  over  the  set  Q  of 
all  words  in  V,  and  then  we  choose  the  word  w  accordingly.  Both  Pr(W )  and  Pg(W ) 
are  multinomial,  where  the  former  has  a  much  smaller  support.  Note  that  in  our 
model,  for  each  word  token  w,  the  decision  whether  it  is  drawn  from  Pr(W )  or  from 
Pg(W)  is  made  independently  of  the  rest  of  the  model,  and  thus  we  can  think  of  the 
dataset  T>  as  a  single  document  of  length  N  =  n\d\  (here  and  in  the  next  section,  we 
assume  that  all  the  documents  are  of  the  same  length  \d\). 

To  make  the  following  theoretical  analysis  easier,  let  us  assume  (quite  unrealisti¬ 
cally)  that  distributions  Pr(W)  and  Pg(W)  are  uniform  rather  than  multinomial.  In 
Section  6.3.1  we  relax  this  assumption  by  flattening  multinomials  using  a  correction 
term.  Under  the  uniformity  assumption,  an  algorithm  for  identifying  relevant  words 
is  straightforward:  obtain  a  sample  of  size  N  and  choose  words  with  counts  above 
a  certain  threshold  to  be  in  TZ  (see  an  illustration  in  Figure  6.2  left).  The  major 
drawback  of  this  algorithm  is  that  we  should  know  the  exact  value  of  the  threshold 
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Figure  6.2.  An  illustration  of  possible  distributions  of  word  counts  in  one-class 
clustering:  (left)  uniform  case;  (right)  multinomial  case.  Words  whose  counts  are 
above  the  threshold  are  considered  relevant.  Note  that  in  the  multinomial  case  counts 
of  some  relevant  words  can  be  lower  than  counts  of  non-relevant  words. 

(an  estimation  is  not  enough  here).  An  alternative  algorithm  would  be:  obtain  a 
sample  of  size  N,  sort  words  in  decreasing  order  of  their  counts  and  choose  the  first 
mr  words  to  be  in  TZ.  Clearly,  the  two  algorithms  are  asymptotically  equivalent  (they 
identify  the  same  set  TZ  if  the  sample  size  N  is  large  enough). 

An  important  question  is  how  large  should  be  the  sample  size  N  so  that  the 
sets  of  relevant  and  non-relevant  words  will  be  separable.  For  instance,  if  N  —  0{jn2) 
samples  are  required,  the  algorithm  described  above  will  be  infeasible  in  any  real-world 
case.  In  the  following  theorem,  we  prove  that  a  log-linear  sample  size  is  enough.  Let 
us  first  introduce  some  notation.  For  a  document  di  and  a  word  token  Wij,  let  n  be  the 
probability  of  drawing  from  the  pool  of  relevant  words  7 Z,  that  is:  P(Zy  =  1)  =  n. 
Let  pw  =  ^  =  |||  be  a  fraction  of  relevant  words  in  the  dataset’s  vocabulary. 

Theorem  6.2.1  To  determine  the  set  TZ  with  probability  1  —  8,  we  need  at  most 

m  in  . 

N  =  16 —  In  —  (6.1) 

7 r  o 

samples,  under  a  (weak)  constraint  of  pw  <  2tt. 

The  proof  of  this  theorem  is  relatively  straightforward — it  involves  an  application 
of  the  Chernoff  bound  and  the  union  bound.  We  prove  this  theorem  in  Appendix  A. 
Now,  under  the  uniformity  assumption  and  conditions  imposed  in  Theorem  6.2.1, 
we  can  identify  the  set  TZ  of  relevant  words  with  arbitrarily  high  probability.  The 
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Figure  6.3.  The  accuracy  (as  defined  in  Section  6.6,  averaged  over  100  independent 
runs)  of  identifying  1Z  in  a  simulation  of  the  generative  process,  over  various  values 
of  the  constant  from  Equation  (6.1)  for  the  sampling  size  N.  In  Equation  (6.1),  the 
value  of  this  constant  is  set  to  16.  Here  we  show  that  the  value  of  2  is  enough  in 
practice. 

relevance  of  a  document  is  then  determined  by  the  cumulative  relevance  of  words 
occurring  in  the  document.  Consequently,  the  core  T)k  will  consist  of  k  documents, 
each  of  which  contains  more  words  from  TZ  than  any  document  from  T>  \  Vk . 

We  simulated  the  generative  process  for  various  values  of  it  and  pw.  We  saw  that 
in  practice  many  fewer  sampling  iterations  were  required  for  identifying  the  set  TZ 
with  100%  accuracy.  In  Equation  (6.1),  the  constant  in  calculating  N  is  set  to  16. 
We  tuned  the  value  of  this  constant,  and  showed  that  the  value  of  2  is  generally 
enough  to  perfectly  identify  TZ.  Figure  6.3  outlines  some  results  on  synthetic  data 
that  has  similar  characteristics  to  our  WAD  dataset  (see  Section  4.7.3):  we  choose 
m  =  12000,  tt  =  pw  =  0.2,  and  5  =  0.01.  For  N  =  330,000,  which  is  the  size  of  the 
WAD  dataset,  we  obtain  98.5%  accuracy.  This  implies  that  if  words  in  text  datasets 
were  indeed  distributed  uniformly,  the  one-class  clustering  problem  would  be  easy. 

6.3  Min-Entropy  algorithm  for  one-class  clustering  in  text 

Obviously,  the  trivial  one-class  clustering  algorithm  from  Section  6.2  above  is  ap¬ 
plicable  only  under  the  restrictive  uniformity  assumption.  Sticking  to  the  uniformity 
assumption  for  now,  we  propose  an  alternative  formal  criterion,  which  in  Section  6.3.1 
will  be  adjusted  to  the  practical  case.  Based  on  this  criterion,  we  design  an  algorithm 
that  directly  identifies  the  core,  and  show  that  this  algorithm  is  optimal  under  the 
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uniformity  assumption.  Let  us  define  a  word  entropy  of  the  dataset  V  as: 

H(W)  =  HP(P(W))  =  -^P{w)\ogP(w)  =  -  Y  P(d,w)logP(w),  (6.2) 

W^lQ  d£T>,w€:Q 

where  P  is  an  empirical  distribution  of  words  in  T>:  given  that  a  word  w  occurs  Nwed 
times  in  a  document  d,  and  Nw  times  in  the  entire  dataset,  we  let  P(d,w)  = 
and  P[w)  =  YhdP(,d,w)  =  Define  a  document-word  entropy  of  a  document  d  as: 

Hd(W)  =  ■  Yj  P{d,  w)  log  P(w)  =  -  J^P(d,w)  log  P(iu).  (6.3) 

wEQ  w£d 

Note  that  the  word  entropy  (6.2)  is  additive:  H(W)  =  Hd{W).  The  document- 

word  entropy  Hd[W )  captures  our  intuition  of  a  core  document:  documents  that 
mainly  use  frequent  words  have  low  Hd(W).  To  see  this,  we  factorize  the  joint 
P(d,w )  =  P(d)P(w\d),  and  assume  that  all  documents  have  a  uniform  prior  P(d)  = 
K  Thus,  Hd(W)  is  the  expectation  of  —  logP(w)  according  to  the  word  frequency 
P(w\d)  in  d,  which  is  small  if  d  uses  a  lot  of  frequent  words. 

Based  on  this  observation,  for  each  subset  T>k  of  size  k,  we  define  our  objective  as 
T>k,s  contribution  to  the  word  entropy  (6.2): 

Hk(W )  =  X>"(«o  =  -  E  P(d,  w)  log  P(w).  (6.4) 

d£T>k  devk,weg 

We  argue  that  the  most  coherent  subset  T>k  is  the  one  that  minimizes  this  objective. 
To  find  the  most  coherent  DA:,  we  use  the  following  simple,  greedy  Min-Entropy 
algorithm: 

1.  Sort  documents  according  to  their  word  entropy  portion  (6.3),  in  increasing 
order. 

2.  Select  the  first  k  documents.  Eliminate  all  the  rest. 
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Since  our  objective  (6.4)  is  additive  in  documents,  its  global  minimum  is  found  by  the 
above  algorithm. 

We  now  show  that  this  algorithm  is  optimal  under  the  uniformity  assumption. 
Indeed,  if  the  dataset  T>  is  large  enough,  then  according  to  Theorem  6.2.1  (with  high 
probability)  any  relevant  word  w  has  a  lower  word-score  —  log  P(w)  than  any  non- 
relevant  word,  because  relevant  words  are  more  frequent  in  V.  Since  we  assume  that 
all  documents  are  of  the  same  length  (\d\  is  constant),  the  Min-Entropy  algorithm 
chooses  documents  that  contain  more  relevant  words  than  any  other  document  in 
the  dataset.  But  this  is  exactly  the  main  property  of  the  core,  as  discussed  in  Sec¬ 
tion  6.2.  Therefore,  the  Min-Entropy  algorithm  identifies  the  core.  We  summarize 
this  observation  in  the  following  theorem: 

Theorem  6.3.1  If  the  dataset  V  is  large  enough,  then  with  high  probability  over 
datasets,  the  Min-Entropy  algorithm  is  optimal  for  the  one-class  clustering  problem 
under  the  uniformity  assumption. 

6.3.1  Relaxation  of  the  uniformity  assumption 

In  practice,  distributions  Pr(W)  and  Pg(W)  are  multinomial  rather  than  uniform 
(see  illustration  in  Figure  6.2  right).  We  modify  the  theory  presented  above  to  this 
case  by  exploiting  the  fact  that  entropy  of  a  distribution  can  be  viewed  as  Kullback- 
Leibler  (KL)  divergence  between  this  distribution  and  a  uniform  one.  In  place  of  the 
entropy  from  Equation  (6.2),  we  propose  to  use  KL  divergence: 

KL{P\\Q)  =  ^  P(w)  log  = 

where  Q(w)  is  an  estimation  of  the  true  probability  of  a  word  occurrence  in  the  English 
language.  This  modification  can  be  thought  of  as  an  adjustment  of  the  empirical  word 
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distribution  in  T>  to  the  uniform  one.  An  algorithm  analogous  to  Min-Entropy  aims 
at  finding  a  subset  Vk  that  maximizes  its  portion  in  (6.5): 

KLk(P\\Q)  —  Y.  W»)logA]7  <66) 

d£Vk,w£Q  ^  ' 

Thus,  we  identify  the  core  T>k  as  a  subset  of  documents  containing  many  words  that 
occur  in  V  more  frequently  than  in  general  English.  Following  [94,  13],  we  exploit 
Web  counts  of  words:  we  estimate  Q{w)  as  a  normalized  count  of  w  in  the  Web.  The 
Web  counts  are  obtained  using  Google  API. 

6.4  One-class  co-clustering  (OCCC) 

As  discussed  in  Section  4.1,  co-clustering  is  a  special  case  of  multi-modal  clus¬ 
tering,  where  only  two  interacting  modalities  are  considered.  In  the  text  domain, 
co-clustering  usually  implies  clustering  documents  D  and  words  W,  either  sequen¬ 
tially  [99],  or  iteratively  [39]. 

In  the  one-class  clustering  case,  the  co-clustering  framework  is  interpreted  as  con¬ 
structing  one  cluster  of  core  documents,  together  with  one  cluster  of  relevant  words. 
The  co-clustering  idea  has  special  importance  for  one-class  clustering,  as  we  want  to 
diminish  the  influence  of  non-relevant  words  on  the  process  of  selecting  core  docu¬ 
ments.  In  many  real-world  cases,  where  \1Z\  -C  \Q\1  the  mass  of  non-relevant  words 
in  the  mixture  p(W)  is  dominant,  while  only  relevant  words  are  responsible  for  a 
document  to  be  relevant.  Reducing  this  mass  is  the  goal  of  one-class  co- clustering. 
By  examining  Equation  (6.6),  it  is  natural  to  define  a  score  of  word  relevance  as: 

P(w) 

s(w)  =  log  (6.7) 

Q{w) 

such  that  our  objective  function  (6.6)  is  the  weighted  average  of  these  scores.  For 
co-clustering  we  propose  to  replace  the  objective  (6.6)  with  the  following: 
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(6.8) 


KL\P\\Q ) 


Y  P'(d,w)  log 

d£T>k  jWG'JZ 


P(w ) 

<2M’ 


where  P'(d,w )  =  P(d,w)/(^2we1lP(d,w))  is  a  joint  distribution  of  documents  and 
(only)  relevant  words.  Because  of  the  re-normalization  introduced,  it  is  not  obvious 
how  to  find  the  global  optimum  of  the  objective  (6.8).  We  thus  propose  to  approx¬ 
imate  it  using  a  simple,  sequential  One-Class  Co-Clustering  (OCCC)  algorithm:  we 
first  build  a  cluster  of  relevant  words  based  on  which  we  build  a  cluster  of  core  doc¬ 
uments,1  as  follows: 


1.  Sort  words  according  to  their  scores  from  Equation  (6.7),  in  decreasing  order. 

2.  Select  a  subset  1Z  of  first  mr  words. 

3.  Represent  documents  as  bags-of-words  over  1Z  (delete  counts  of  all  words  from 

Q\n). 

4.  For  each  document  d,  calculate  its  portion  in  Equation  (6.8): 


KLd(P\\Q)  =  YP'^w)l°S^:=  Y  P'(d’™)log7WW 

w&n  ^  l  weTind  ^  ' 


(6.9) 


5.  Sort  documents  according  to  their  scores  from  Equation  (6.9),  in  decreasing 
order. 

6.  Select  a  subset  Vk  of  the  first  k  documents. 


Despite  its  simplicity,  the  OCCC  algorithm  shows  excellent  results  on  real-world  data 
(see  Section  6.6).  The  algorithm’s  complexity  is  particularly  appealing:  O(N),  where 
N  is  the  number  of  word  tokens  in  T>. 


Rn  this  simplest  algorithm,  word  clustering  is  analogous  to  feature  selection ,  in  which  selected 
features  correspond  to  only  one  class  of  the  data.  In  more  complex  algorithms  though,  this  analogy 
will  be  less  obvious. 
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6.4.1  Heuristic  for  choosing  the  size  of  word  cluster 

The  choice  of  mr  can  be  crucial.  While  not  proposing  a  comprehensive  method 
for  choosing  mr,  we  propose  a  useful  heuristic.  The  distribution  of  scores  s(w)  for 
relevant  words  can  be  modeled  by  a  normal  distribution  with  mean  /zr  0  and 
variance  of.  Analogously,  the  distribution  of  word  scores  for  non- relevant  words  is 
modeled  by  a  normal  distribution  with  mean  /znr  =  0  and  variance  a^r-  We  assume 
that  all  the  words  with  negative  scores  are  non- relevant.  Since  the  normal  distribution 
is  symmetric,  we  further  assume  that  the  number  of  non-relevant  words  with  negative 
scores  equals  the  number  of  non-relevant  words  with  positive  scores.  Therefore,  our 
estimate  of  total  non-relevant  words  is  twice  the  number  of  words  with  negative 
scores,  and  the  number  of  relevant  words  can  thus  be  estimated  as  mr  =  m  —  2  • 
7^{words  with  negative  scores}. 

6.5  The  Latent  Topic/Background  (LTB)  model 

Here  we  revise  our  generative  model  from  Section  6.2  and  propose  another  one- 
class  clustering  algorithm  based  on  probabilistic  inference.  Our  new  generative  model 
is  shown  in  the  right  panel  of  Figure  6.1.  For  each  document  di,  Yi  is  a  Bernoulli 
random  variable  where  Yt  —  1  corresponds  to  di  being  relevant.  For  each  word  token 
Wij,  Zij  is  a  Bernoulli  random  variable  where  Z%j  =  1  means  that  Wij  is  sampled  from 
the  multinomial  distribution  Pr(W )  over  relevant  words,  otherwise  it  is  sampled  from 
the  general  multinomial  distribution  Pg(W)  over  all  words  in  T>. 

Following  [53],  we  admit  that  not  all  words  in  a  relevant  document  should  be 
relevant.  In  our  model,  if  a  document  belongs  to  the  core  {Yt  =  1),  for  each  its 
word  we  make  a  decision  (based  on  Zig)  whether  it  is  sampled  from  Pr(W )  or  Pg(W). 
However,  if  a  document  does  not  belong  to  the  core  ( 1}  =  0),  each  its  word  is  sampled 
from  Pg(W ),  i.e.  P(Zt]  =  Q\Yt  =  0)  =  1. 
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We  use  the  Expectation-Maximization  (EM)  algorithm  to  learn  parameters  of 
our  model  from  the  dataset.  We  now  describe  the  model  parameters  0.  First,  the 
probability  of  a  document  belonging  to  the  core  is  denoted  by  P(Yt  =  1  )  =  ^  =  pd 
(this  parameter  is  fixed  and  will  not  be  inferred  from  data).  Second,  for  each  document 
di,  we  maintain  a  probability  of  each  its  word  being  relevant  (given  that  the  document 
is  relevant),  P{Zl3  =  1|Y)  =  1)  =  7r,;  for  i  =  1, . . . ,  n.  Third,  for  each  word  Wi\i^l  we  let 
P{wi\Zi  =  1)  =  pr(wi )  and  P(wi\Zi  =  0)  =  pg(wi).  The  overall  number  of  parameters 
is  n  +  2m  +  1,  one  of  which  ( pd )  is  preset.  The  dataset  likelihood  is  then: 

n 

P(V)  =  n  [pd  P(di\Yi  =  1)  +  (l-pjPidilYi  =  0)]  = 

2=1 


n  di  |  |  di  | 

=  JJ  Pd  Y[[niPr(wij)  +  (!  -  Ki)Pg(Wij)]  +  (1  “  Pd)  Y[pg(™ij 

i=l  j= 1  j= 1 

At  each  iteration  t  of  the  EM  algorithm,  we  first  perform  the  E-step,  where  we  com¬ 
pute  the  posterior  distribution  of  hidden  variables  {T)}  and  {Zl3}  given  the  current 
parameter  values  04  and  the  data  T>.  Then,  at  the  M-step,  we  compute  the  new 
parameter  values  0t+1  that  maximize  the  model  log-likelihood  given  04,  T>  and  the 
posterior  distribution. 

The  initialization  step  is  crucial  for  the  EM  algorithm.  Our  pilot  experimentation 
showed  that  if  distributions  Pr(W)  and  Pg(W)  are  initialized  as  uniform,  the  EM 
results  are  close  to  random.  Therefore,  we  borrow  an  idea  from  our  OCCC  model 
(Section  6.4)  and  initialize  word  probabilities  proportional  to  their  relevance  scores 
from  Equation  (6.7).  Initialization  of  7 q  parameters,  which  are  the  ratio  of  relevant 
words  in  relevant  documents,  is  a  problem  analogous  to  determining  the  word  cluster 
size  in  OCCC  (see  Section  6.4.1).  We  do  not  propose  the  optimal  way  to  initialize  7 r* 
parameters,  however,  as  we  show  later  in  Section  6.6,  the  EM  algorithm  appears  to 
be  quite  robust  to  the  choice  of  7 q,  namely,  7Tj  =  0.5  (or  close  to  that)  leads  to  a  good 
result. 
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Input: 

V  the  dataset 

s(wi)  -  score  for  each  word  wi\1^.1,  from  Equation  (6.7) 

T  -  number  of  EM  iterations 

Output:  Posteriors  P(Y)  =  1| di,Qq)  for  each  document  di |"=1 


Initialization: 

for  each  document  di  initialize  nj 

for  each  word  Wi  initialize  p^(wi )  =  exp(s(u;;));  plg{u>i)  =  jj-  exp(— s(wi)), 

where  Sr  and  Sg  are  normalization  factors 


Main  loop: 


For  each  t  =  1, . . . ,  T  do 
E-step: 

for  each  document  di  compute  a\  =  P(Yt  =  l|di;  0*) 

for  each  word  token  Wij  compute  /3|-  =  P(Zij  =  1|R  =  1,  Wij.  (-)*) 

M-step: 

for  each  document  di  update  nt+1  =  p-j  Plj 
for  each  word  wi  update 


p‘+1N)  = 


Ei  al  E,  s(wa  =  wi )  Pi 


EiotZiV 


?4+1M  = 


Nw  -  Ei  al  E,  P\ 


•V-E.n(V  .t 


Algorithm  4:  EM  algorithm  for  one-class  clustering  using  the  LTB  model. 


The  EM  procedure  is  sketched  in  Algorithm  4.  We  omit  minor  details,  see  Ap¬ 
pendix  B  for  more  detailed  description  of  the  algorithm.  After  T  iterations,  we  sort 
the  documents  according  to  in  decreasing  order  and  choose  the  first  k  documents 
to  be  the  core.  The  complexity  of  our  implementation  of  Algorithm  4  is  0(TN).  To 
avoid  overfitting,  we  set  T  to  be  a  small  number:  in  our  experiments  we  fix  T  =  5. 


6.6  Experimentation  with  OCCC  and  LTB 

To  define  our  evaluation  criteria,  let  C  be  the  constructed  cluster  and  let  Cr 
be  its  portion  consisting  of  documents  that  actually  belong  to  the  core.  Preci¬ 
sion  is  then  defined  as  Prec  =  |CV|/|C|,  recall  as  Rec  =  \Cr\/k  and  F-measure  as 
(2  Prec  Rec)/(Prec+Rec).  In  all  our  experiments  we  fix  \C\  =  k,  such  that  precision 
equals  recall  and  is  then  called  one-class  clustering  accuracy,  or  just  accuracy. 
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Figure  6.4.  Web  appearance  disambiguation,  (left)  OCCC  accuracy  as  a  function 
of  the  word  cluster  size;  (right)  accuracy  of  LTB  (with  the  underlying  EM  algorithm) 
over  various  initializations  of  7 r,  parameters:  LTB  shows  a  more  robust  behavior  than 
OCCC,  however  LTB’s  maximal  result  (80.2%)  is  slightly  inferior  to  the  OCCC’s 
(82.4%). 


6.6.1  Web  appearance  disambiguation 

The  Web  appearance  disambiguation  (WAD)  task  is  described  in  Section  4.7. 
WAD  is  a  classic  one-class  clustering  task,  that  was  solved  in  that  section  using  a 
simulated  one-class  clustering  method:  multiple  clusters  are  constructed,  out  of  which 
one  cluster  is  then  selected.  Here  we  propose  a  more  effective  solution. 

We  test  our  methods  on  the  WAD  dataset  (Section  4.7.3).  The  dataset  consists 
of  the  1085  pages,  out  of  which  420  are  relevant,  so  we  apply  our  algorithms  with 
k  =  420.  At  a  preprocessing  step,  we  binarize  document  vectors  and  remove  low 
frequency  words  (both  in  terms  of  P(w)  and  Q(w)).  The  results  are  summarized  in 
Figure  6.4.  On  its  left  panel,  the  x-axis  corresponds  to  the  hypothetic  number  of 
relevant  words,  and  the  y-axis  to  accuracy.  The  best  OCCC  performance  is  obtained 
with  mr  =  2200  words:  82.4%  accuracy,  while  the  F-measure  reported  in  Section  4.7.5 
is  78.4%  (on  a  cluster  with  less  than  420  documents — its  recall  is  only  71.3%). 

As  can  be  seen  from  the  left  panel  of  Figure  6.4,  the  OCCC  performance  is  robust: 
accuracy  above  80%  is  obtained  with  a  word  cluster  of  any  size  in  the  1000-3000 
range.  The  heuristic  from  Section  6.4.1  suggests  a  cluster  size  of  1000.  The  right 
panel  of  Figure  6.4  shows  the  LTB  accuracy  over  various  initialization  values  of  the 
7 Tj  parameter  (the  fraction  of  relevant  words  in  core  documents).  We  can  infer  from 


101 


# 

OCCC 

LTB 

# 

OCCC 

LTB 

# 

OCCC 

LTB 

1 

cheyer 

artificial 

8 

mlittman 

proceedings 

15 

gorfu 

kaelbling 

2 

kachites 

learning 

9 

hardts 

computational 

16 

billmark 

andrew 

3 

quickreview 

cs 

10 

meuleau 

reinforcement 

17 

pomdps 

conference 

4 

adddoc 

intelligence 

11 

dipasquo 

papers 

18 

ml95 

markov 

5 

aaai98 

machine 

12 

shakshuki 

emu 

19 

agentus 

Stanford 

6 

kaelbling 

edu 

13 

xevil 

aaai 

20 

megacanje 

models 

7 

mviews 

algorithms 

14 

sangkyu 

workshop 

Table  6.1.  Most  highly  ranked  words  by  OCCC  and  LTB,  on  the  WAD  dataset. 


this  plot  that  LTB  is  even  more  robust  to  parameter  initialization  than  OCCC:  any 
but  very  large  (i.e.  i q  «  1)  values  can  be  chosen. 

Finally,  Table  6.1  lists  the  top  20  words  according  to  the  models  learned  by  OCCC 
and  by  LTB.  The  OCCC  algorithm  sorts  words  according  to  their  score  s(w),  such 
that  words  that  often  occur  in  the  dataset  but  rarely  in  the  Web,  are  on  the  top  of 
the  list.  These  are  mostly  last  names  or  login  names  of  researchers,  venues  etc.  The 
EM  algorithm  of  LTB  is  given  the  OCCC’s  word  rank  list  as  an  input  to  initialize 
pl(w)  and  Pg(w),  which  are  then  updated  at  each  M-step.  In  the  LTB  column,  words 
are  sorted  by  pl(w).  The  high  quality  of  the  LTB  list  is  due  to  explaining  away  in 
our  generative  model  (via  the  Yt  nodes).  Still,  OCCC  (marginally)  outperforms  LTB 
on  this  dataset:  the  maximal  result  obtained  by  OCCC  is  82.4%  accuracy,  while  LTB 
obtains  80.2%  accuracy. 

6.6.2  Re-ranking  Web  retrieval  results 

Modern  search  engines  are  usually  successful  in  identifying  relevant  documents  for 
a  given  general-type  query.  However,  in  most  cases  some  of  the  top-ranked  documents 
have  only  marginal  relation  to  the  query.  For  example,  querying  Google  for  Beatles , 
many  top-ranked  documents  indeed  talk  about  the  quartet,  however,  one  can  see  a 
document  about  the  Apple  Corps  vs.  Apple  Computer  trial  (which  is  certainly  not 
about  Beatles),  and  some  other  clearly  non-relevant  documents. 
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QUERY 

GOOGLE 

OC-SVM 

OC-IB 

OCCC 

LTB 

Godfather 

0.444 

0.407 

0.400 

0.852 

0.926 

Bunker  Hill 

0.487 

0.590 

0.821 

0.897 

0.923 

Beatles 

0.400 

0.457 

0.571 

0.629 

0.771 

Table  6.2.  Re-ranking  Web  retrieval  results:  We  compare  one-class  clustering 
accuracy  of  our  OCCC  (with  heuristic  from  Section  6.4.1)  and  LTB  (initialized  with 
7Tj  =  0.5)  models  with  the  accuracy  of  the  original  Google  rank  lists,  of  one-class  SVM 
(OC-SVM)  and  of  one-class  Information  Bottleneck  (OC-IB)  [28]  with  Z2-norm. 


In  this  section,  we  leverage  the  high  quality  of  Web  retrieval  results  and  attempt  to 
improve  them  even  further.  Our  assumption  is  that  relevant  documents  are  topically 
close  to  each  other,  while  non-relevant  documents  can  be  on  any  topic.  We  notice  that 
as  soon  as  a  few  relevant  documents  appear  among  the  n  top-ranked  results,  we  can 
apply  our  one-class  clustering  methods  to  the  task  of  re-ranking  those  results,  where 
the  goal  is  to  move  relevant  documents  up  in  the  ranked  list,  while  moving  non- 
relevant  ones  down  the  list.  In  one-class  clustering,  we  identify  the  most  coherent 
subset  (i.e.  the  core)  from  a  set  of  n  documents.  Assuming  that  core  documents  are 
relevant,  while  non-core  documents  are  non-relevant,  we  re-organize  the  ranked  list 
such  that  core  documents  are  now  located  above  non-core  ones,  while  preserving  the 
initial  ordering  within  both  the  core  and  non-core  subsets. 

Note  that  the  problem  of  one-class  clustering  for  re-ranking  Web  retrieval  results 
is  similar  to  the  problem  of  pseudo-relevance  feedback  (see,  e.g.  [104]).  However, 
the  two  problems  are  still  fundamentally  different.  In  pseudo-relevance  feedback,  one 
assumes  that  the  first  k  documents  in  a  ranked  list  are  relevant,  and  re-ranks  the 
rest  of  the  ranked  list  based  on  that  assumption.  In  one-class  clustering,  in  contrast, 
we  make  a  weaker  assumption  that  the  k  relevant  documents  exist  within  the  first  n 
documents  in  a  ranked  list.  Our  task  is  then  to  discover  those  k  documents  and  place 
them  on  the  top  of  the  ranked  list. 
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We  test  the  resulting  system  on  three  small  datasets  that  we  created  for  this 
chapter.  Each  of  them  contains  100  first  Google  hits  retrieved  on  a  certain  query, 
labeled  as  relevant  /  non-relevant  with  regards  to  the  major  meaning  of  the  query. 
These  queries  are: 

•  Godfather.  While  the  word  Godfather  is  ambiguous,  a  query  Godfather  most 
probably  refers  to  the  popular  movie /book2 — other  readings  are  considered  non- 
relevant.  Among  the  set  of  100  documents,  27  were  annotated  as  relevant. 

•  “Bunker  Hill”.  The  phrase  Bunker  Hill  is  not  ambiguous,  and  a  user  who 
types  such  a  query  is  presumably  interested  in  information  about  the  Bunker 
Hill  battle  and/or  monument.  However,  some  Bunker  Hill  mentions  are  not 
directly  related  to  the  historical  event,  e.g.  Bunker  Hill  Community  College  or 
Bunker  Hill  Presbyterian  Church.  This  dataset  contains  39  relevant  documents. 

•  Beatles.  The  obvious  reading  of  the  query  Beatles  is  the  name  of  the  legendary 
quartet.  All  the  100  first  Google  hits  refer  to  the  quartet,  however  only  35  of 
them  provide  information  about  the  quartet,  such  as  their  biography  or  discog¬ 
raphy,  while  this  is  (almost)  certainly  the  type  of  information  a  user  expects  to 
retrieve  on  query  Beatles. 

We  compare  our  methods  with  two  previously  proposed  one-class  clustering  tech¬ 
niques:  an  unsupervised  one-class  SVM  and  a  one-class  Information  Bottleneck  (see  [28] 
for  details  on  those  methods).  Our  results  are  shown  in  Table  6.2;  together  with  the 
two  baselines,  we  list  the  accuracies  of  the  original  Google’s  ranked  lists,  where  the 
first  k  documents  are  considered  the  core,  while  the  rest  of  n  —  k  documents  are  con¬ 
sidered  the  noise.  Our  methods  clearly  outperform  the  baselines,  while  LTB  shows 
better  performance  than  OCCC. 


2According  to  imdb.com.  The  Godfather  is  the  world’s  most  popular  film  to  date. 
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6.6.3  Detecting  the  topic  of  the  week 

As  we  discussed  in  this  chapter’s  introduction,  the  real-world  data  rarely  consists 
of  a  clean  core  and  uniformly  distributed  noise.  Usually,  the  noise  has  some  structure, 
namely,  it  may  contain  coherent  components.  With  this  respect,  one-class  clustering 
can  be  used  to  detect  the  largest  coherent  component  in  a  dataset,  which  is  an  integral 
part  of  many  applications.  In  this  section,  we  solve  the  problem  of  automatically 
detecting  the  topic  of  the  week  (TW)  in  a  newswire  stream,  i.e.  detecting  all  articles 
in  a  weekly  news  roundup  that  refer  to  the  most  broadly  discussed  event. 

The  TW  detection  task  can  be  considered  as  a  subtask  of  Topic  Detection  and 
Tracking  (TDT)  [2],  and  is  closely  related  to: 

•  Generating  topic  overviews  [103].  A  topic  overview  is  a  set  of  keywords  that 
best  describe  the  discussed  topic.  Using  the  one-class  clustering  terminology, 
such  set  is  the  cluster  of  relevant  words.  In  our  OCCC  approach,  we  generate 
both  a  subset  of  core  documents  and  a  subset  of  relevant  words.  In  LTB,  we 
rank  documents  and  words  according  to  their  likelihood  of  belonging  to  the 
core. 

•  Discovering  thematic  changes  [103,  52],  Major  topics  (represented  both  as 
subsets  of  documents  and  as  their  descriptive  words)  are  changing  with  time. 
In  our  work,  we  deal  with  those  changes  by  discretizing  the  timeline  into  weeks. 
A  topic  that  was  most  broadly  discussed  one  week,  may  or  may  not  remain  so 
the  next  week. 

•  Quantifying  trends  [44].  The  trend  quantification  task  aims  at  discovering 
how  large  a  certain  topic  is,  without  necessarily  mapping  documents  to  topics. 
In  TW  detection,  however,  the  task  is  to  discover  which  topic  is  the  largest  one. 
Also,  trend  quantification  is  an  intrinsically  supervised  task,  while  TW  detection 
can  be  formulated  both  in  terms  of  supervised  and  unsupervised  learning. 
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We  evaluate  the  TW  detection  task  on  the  TDT-5  dataset3,  which  consists  of  250 
news  events  spread  over  a  time  period  of  half  a  year,  and  9,812  documents  In  English, 
Arabic  and  Chinese  (translated  to  English),  annotated  by  their  relationship  to  those 
events.4  The  largest  event  in  TDT-5  dataset  (#55106,  titled  “Bombing  in  Riyadh, 
Saudi  Arabia”)  has  1,144  relevant  documents,  while  66  out  of  the  250  events  have  only 
one  relevant  document  each.  We  split  the  dataset  to  26  weekly  chunks  (to  have  26 
full  weeks,  we  delete  all  the  documents  dated  with  the  last  day  in  the  dataset,  which 
decreases  the  dataset’s  size  to  9,781  documents).  Each  chunk  contains  from  138  to 
1292  documents.  Over  each  chunk,  we  applied  our  one-class  clustering  methods  in 
four  setups: 

•  OCCC  with  the  mr  heuristic  (from  Section  6.4.1). 

•  OCCC  with  optimal  mr.  We  unfairly  choose  the  number  mr  of  relevant  words 
such  that  the  resulting  accuracy  is  maximal.  This  setup  can  be  considered  as  the 
upper  limit  of  the  OCCC’s  performance,  which  can  be  hypothetically  achieved 
if  a  better  heuristic  for  choosing  mr  is  proposed. 

•  LTB  initialized  with  7T;  =  0.5.  As  we  show  in  Sections  6.6.1  and  6.6.2  above, 
if  Hi  parameters  are  initialized  with  0.5,  the  LTD  model  shows  good  results. 

•  LTB  initialized  with  7 r*  =  p,i-  We  notice  a  significant  deviation  in  the  core’s 
size  among  our  26  datasets.  Quite  naturally,  the  number  of  relevant  words  in 
a  dataset  depends  on  the  number  of  core  documents.  For  example,  if  the  core 
is  only  10%  of  a  dataset,  it  is  unrealistic  to  assume  that  50%  of  all  words  are 
relevant.  In  this  setup,  we  condition  the  ratio  of  relevant  words  on  the  ratio  of 
core  documents. 


3http : / / projects . ldc . upenn . edu/TDT5/ 

4We  take  into  account  only  labeled  documents,  while  ignoring  unlabeled  documents  that  can  be 
found  in  the  TDT-5  data. 
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Figure  6.5.  One-class  clustering  results  on  the  “topic  of  the  week”  detection  task. 

One-class  clustering  accuracies  per  week  are  shown  in  Figure  6.5.  These  results 
reveal  very  interesting  observations.  First,  OCCC  methods  tend  to  outperform  LTB 
only  on  datasets  where  the  results  are  quite  low  in  general  (less  than  60%  accuracy). 
Specifically,  on  weeks  2,  4,  11  and  16  the  LTB  models  demonstrates  extremely  poor 
performance.  While  investigating  this  phenomenon,  we  discovered  that  in  two  of  the 
four  cases  LTB  was  able  to  construct  very  clean  core  clusters,  however,  those  clusters 
corresponded  to  the  second  largest  topic  rather  than  to  the  largest  one.  For  example, 
on  the  week-4  data,  topic  #55077  (  “River  ferry  sinks  on  Bangladeshi  river”)  was 
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Method 

Accuracy 

OCCC  with  the  mr  heuristic 
OCCC  with  optimal  mr 

LTB  initialized  with  7r j  =  0.5 
LTB  initialized  with  7 =  pd 

61.4  ±4.5% 

68.3  ±  3.6% 

65.3  ±  7.3% 
68.0  ±  5.9% 

Table  6.3.  One-class  clustering  accuracy  on  the  “topic  of  the  week”  de¬ 
tection  task.  The  accuracies  are  macro-averaged  over  the  26  weekly  data  chunks. 
Standard  error  of  the  mean  is  presented  after  the  ±  sign. 


discovered  as  the  largest  and  the  most  coherent  one.  In  that  dataset,  topic  #55077 
is  represented  by  20  documents,  while  topic  #55063  (  “SARS  Quarantined  medics  in 
Taiwan  protest”)  is  represented  by  27  documents,  such  that  topic  #55077  is  the  second 
largest  one.  Another  interesting  observation  is  that  the  (completely  unsupervised) 
LTB  model  can  obtain  very  high  results  on  some  of  the  data  chunks.  For  example, 
on  weeks  5,  8,  19,  21,  23,  24,  25  LTB’s  accuracy  is  above  90%,  with  a  striking  100% 
on  week-23. 

The  one-class  clustering  accuracies,  macro- averaged  over  the  26  weekly  chunks, 
are  presented  in  Table  6.3.  As  we  can  see,  both  LTB  models  outperform  the  OCCC 
variation  where  the  mr  heuristic  is  applied.  Moreover,  even  the  optimal  choice  of 
mr  does  not  cause  OCCC  to  perform  significantly  better  than  LTB.  The  dataset- 
dependent  initialization  of  LTB’s  7 r*  parameters  (7^  =  pd )  appears  to  be  preferable 
over  the  dataset- independent  one  (7 t*  =  0.5). 


6.7  Summary 

We  have  addressed  the  problem  of  inducing  objective  functions  in  Comraf  models. 
For  the  task  of  one-class  clustering,  we  proposed  an  information-theoretic  and  a  proba¬ 
bilistic  objective  functions,  as  well  as  algorithms  for  their  optimization.  The  proposed 
algorithms  are  very  simple,  very  efficient  and  still  surprisingly  effective.  More  sophis¬ 
ticated  algorithms  (e.g.  better  optimization  of  the  objective  function  in  OCCC)  are 
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emerging.  Also,  since  the  Comraf  framework  allows  straightforward  generalization 
of  OCCC  to  one-class  clustering  with  many  modalities,  it  will  be  interesting  to  see 
whether  one-class  clustering  results  can  be  improved  by  adding  more  modalities,  such 
as  author  names  or  hyperlinks. 

Our  evaluation  of  one-class  clustering  models  on  the  re-ranking  task  is  preliminary. 
It  gives  positive  signals  in  the  Web  search  case,  where  queries  are  of  the  general  type 
and  unlikely  to  be  ambiguous.  Also,  one-class  clustering  is  likely  to  be  useful  in  Topic 
Detection  and  Tracking.  However,  our  pilot  experimentation  in  the  ad-hoc  retrieval 
domain  shows  rather  negative  results.  In  ad-hoc  retrieval,  our  main  assumption  that 
the  noise  has  no  or  little  structure  is  generally  wrong.  For  example,  querying  TREC 
1  and  2  data  for  acid  rain ,  the  majority  of  1000  retrieved  documents  are  actually 
weather  reports,  most  probably  because  all  the  other  documents  in  the  collection  are 
even  less  relevant.  Since  one-class  clustering  methods  do  not  take  the  query  into 
account,  and  since  the  weather  reports’  subset  is  the  largest  and  the  most  coherent 
one  in  the  set  of  retrieved  documents,  our  re-ranking  hurts  the  ranking  results  on  that 
query.  Evaluating  one-class  clustering  methods  on  other  related  tasks  is  the  subject 
of  our  future  work. 
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CHAPTER  7 


IMAGE  CLUSTERING  WITH  COMRAFS 


In  this  chapter,  we  revise  the  Comraf  clustering  mechanism,  proposed  in  Chap¬ 
ter  4.  Based  on  the  concept  of  observed  combinatorial  random  variables  (discussed 
in  Chapter  5),  we  adapt  the  Comraf  model  to  the  case  where  the  data  consists  of 
both  sparse  modalities  (which  need  to  be  clustered)  and  dense  modalities  (not  to  be 
clustered).  We  also  generalize  the  Comraf  clustering  objective  function,  making  it 
more  flexible  and  adjustable  to  a  variety  of  real-world  tasks.  These  two  innovations 
finalize  the  development  of  the  Comraf  framework  toward  giving  a  comprehensive 
recipe  for  modeling  with  Comrafs,  which  we  present  in  Chapter  8. 

We  focus  here  on  multi-modal  clustering  of  image  collections,  particularly  of  those 
where  images  are  associated  with  textual  captions.1  Besides  the  caption  words’  modal¬ 
ity,  we  consider  visual  modalities,  both  global  (such  as  colors,  texture)  and  local 
(regions,  blobs).  For  details,  see  Section  7.4  below.  Image  clustering  can  be  a  use¬ 
ful  component  in  a  retrieval  system  [26],  it  can  also  be  a  stand-alone  application, 
for  example,  for  constructing  semantic  groups  of  image  retrieval  results  [108],  or  for 
browsing  image  collections  [5].  Unfortunately,  existing  uni-modal  clustering  methods 
often  demonstrate  poor  performance  on  the  image  clustering  task.  In  this  chapter, 
we  show  that  by  employing  the  multi-modal  learning  paradigm  we  can  significantly 
improve  image  clustering  results. 

Multi-modal  clustering  of  images  has  an  important  difference  when  compared  to 
multi-modal  clustering  of  documents.  Document  features,  such  as  words,  POS  tags 

XA  preliminary  version  of  this  work  [12]  was  published  at  CVPR  2007. 
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etc.  are  situated  in  a  discrete,  finite  space.  Two  textual  features  can  be  either  identical 
or  not.  Some  visual  features,  in  contrast,  are  unique.  These  are  local  features,  such 
as  interest  points  [91],  image  regions  [56]  etc.  An  affinity  metric  should  be  defined  to 
estimate  similarity  of  those  features.  We  find  at  least  two  disadvantages  in  working  in 
the  affinity  space.  First,  the  choice  of  the  affinity  metric  is  often  arbitrary.  Second,  the 
affinity  metric  is  defined  for  each  pair  of  data  points,  which  makes  the  computational 
complexity  of  related  clustering  algorithms  quadratic  in  the  best  case.  In  this  thesis, 
we  aim  at  avoiding  the  explicit  definition  of  the  affinity  metric  (see  Section  7.4.2). 

7.1  Related  work 

The  idea  of  clustering  images  using  both  low-level  image  features  and  surrounding 
text  (i.e.  grouping  together  visually  similar  and  semantically  related  images)  has 
attracted  close  attention  of  the  research  community.  Barnard  et  al.  [5]  propose  a 
generative  hierarchical  model  for  image  clustering,  in  which  every  node  generates 
words  and  blobs  based  on  the  given  probability  distributions  for  that  node.  Higher 
level  nodes  generate  more  general  terms  and  lower  level  nodes  generate  more  specific 
terms.  The  EM  algorithm  is  used  to  fit  the  model.  This  approach  can  handle  only 
two  feature  types  (words,  blobs);  to  handle  more  types,  the  model  and  the  learning 
procedure  must  be  revised. 

Cai  et  al.  [25]  cluster  Web  image  search  results  using  visual,  textual  and  link  anal¬ 
ysis.  They  extract  text  relevant  to  the  image  using  a  vision-based  page  segmentation 
algorithm.  First,  only  text  and  hyperlink  data  is  used  to  cluster  images.  The  resulting 
clusters  are  clustered  again  using  low-level  image  features.  Loeff  et  al.  [73]  apply  a 
similar  approach:  they  calculate  a  histogram  of  gradient  magnitude  of  the  pixel  val¬ 
ues  from  every  interest  point  and  then  cluster  images  using  these  local  features  with 
global  color  histograms  and  surrounding  text.  Both  Cai  et  al.  and  Loeff  et  al.  use 
spectral  clustering  methods  (where  the  affinity  scores  for  every  pair  of  data  instances 
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of  every  modality  must  be  calculated),  which  are  computationally  infeasible  in  many 
other  multi-modal  applications. 

Bipartite  spectral  graph  partitioning  [34]  is  useful  for  co-clustering  two  modalities 
such  as  documents  and  words.  Gao  et  al.  [47]  extend  this  method  to  handle  one 
more  modality.  In  their  tripartite  graph  model,  nodes  are  arranged  in  three  layers: 
words,  images  and  image  features.  To  handle  more  modalities,  Gao  et  al.  [48]  propose 
another  method  that  is  most  closely  related  to  our  work:  they  organize  modalities  in 
a  star  structure  of  interrelationships,  where  a  central  modality  is  connected  to  all  the 
others.  They  treat  this  problem  as  fusion  of  multiple  pairwise  co-clustering  problems. 
Each  sub-problem  is  solved  using  the  bipartite  graph  partitioning  method. 

Our  approach  has  a  few  advantages  over  the  others.  First,  our  method  has  no 
practical  limitation  in  the  number  of  modalities  as  long  as  the  pairwise  interaction 
data  is  available — the  addition  of  a  modality  increases  the  computational  complexity 
only  linearly.  Second,  our  model  can  cluster  multiple  modalities  while  taking  into 
account  other  modalities,  which  do  not  have  to  be  clustered.  Third,  our  information- 
theoretic  clustering  method  does  not  rely  on  hard-to-obtain  affinity  matrices  of  in¬ 
dividual  modalities.  Instead,  easily  computable  contingency  tables  of  interacting 
modalities  are  used.  Overall,  we  propose  a  general  framework  for  clustering  multime¬ 
dia  collections,  which  can  be  straightforwardly  applied  to  video  data,  sound  tracks, 
hypertext  etc.  as  well  as  to  any  of  their  combinations. 

7.2  Multi-modal  clustering  objective,  revisited 

In  Section  4.1,  we  proposed  an  objective  function  for  multi-modal  clustering  as 
the  sum  of  pairwise  Mutual  Information  between  interacting  clusterings: 
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subject  to  \X,\  =  ki ,  where  i  —  1, . . . ,  m.  In  Section  3.3  we  discussed  one  disadvantage 
of  a  global  objective  function  like  that:  Mutual  Information  terms  can  significantly 
vary  in  their  magnitude,  dependency  on  the  support  size  of  corresponding  variables. 
Summing  these  terms  together  can  cause  an  undesired  effect  of  artificial  preference 
of  some  interactions  over  the  others.  A  natural  generalization  of  this  objective  would 
be  to  consider  a  weighted  linear  combination  of  pairwise  Mutual  Information  terms: 

xc*  =  argmax  V'  f3iV I (X^  Xv) ,  (7.1) 

where  the  weights  j3w  are  chosen  using  some  domain  knowledge.  An  obvious  choice 
of  the  weights  is  such  that  all  the  Mutual  Information  terms  are  to  be  brought  to  the 
same  scale.  Another  factor  for  choosing  the  weights  is  to  make  them  correspond  to 
various  importance  levels  of  various  interactions.  For  example,  if  images  are  clustered 
based  on  their  captions  and  their  color  histograms,  the  images/captions  interaction 
can  have  a  heavier  weight  than  the  weight  of  the  images/colors  interaction. 

In  some  cases,  weights  fJn/  can  be  adjusted  during  the  course  of  an  inference 
algorithm,  in  an  annealing  framework.  Also,  the  weights  can  be  learned  using  a 
model  learning  procedure.  Both  these  extensions  are  left  for  our  future  work. 

7.3  Comraf*:  a  lightweight  version  of  the  Comraf  model 

In  previous  chapters  we  made  it  obvious  that,  in  most  real-world  situations,  a 
practitioner  is  interested  in  clustering  only  one  modality  (images,  in  our  case),  which 
we  call  here  a  target  modality.  This  implies  that  not  every  modality  has  to  be 
clustered:  if  a  representation  of  a  modality  is  dense  enough,  clustering  it  may  cause 
an  underestimation  of  the  joint  (an  effect  known  as  oversmoothing) ,  which  may  hurt 
clustering  results  of  the  target  modality.  For  example,  if  images  are  distributed  over 
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Figure  7.1.  Cornraf*  models:  (a)  for  images  Gc  and  words  Wc  from  their  captions; 
(b)  for  images,  words  and  colors  Cc\  (c)  for  images,  words,  colors  and  blobs  Bc ;  (d) 
straightforward  generalization  to  any  number  of  modalities. 

256  colors,  it  makes  no  sense  to  simultaneously  cluster  images  and  colors  because  the 
distributions  are  already  dense  enough. 

In  this  section,  we  propose  a  special  case  of  Cornraf  models,  in  which  only  the  tar¬ 
get  modality  is  clustered,  while  the  representations  of  all  the  other  modalities  are  as¬ 
sumed  to  be  dense  enough.  Each  unclustered  modality  is  associated  with  an  observed 
combinatorial  random  variable.  Recall  that  a  combinatorial  random  variable  is  de¬ 
fined  over  all  the  possible  clusterings  of  a  given  set.  In  case  of  unclustered  modalities, 
the  observed  value  of  a  corresponding  combinatorial  random  variable  is  a  clustering 
of  all  singleton  clusters.  For  example,  given  a  set  {red,  green,  blue},  the  observed 
value  of  a  corresponding  combinatorial  random  variable  is  {{red},  {green},  {blue}}. 

Each  observed  combinatorial  random  variable  of  an  unclustered  modality  is  con¬ 
nected  by  an  edge  with  a  hidden  combinatorial  random  variable  of  the  target  modality. 
Observed  nodes  are  not  connected  to  each  other  because  they  are  statistically  indepen¬ 
dent  by  definition.  Hence,  the  resulting  topology  of  the  Cornraf  model  is  an  asterisk 
with  the  target  modality  in  the  center.  We  call  such  a  model  Cornraf*.  Examples  of 
Cornraf*  graphs  are  given  in  Figure  7.1.  Even  though  only  one  modality  is  clustered 
in  Cornraf*,  it  is  still  a  model  for  multi-modal  clustering,  as  multiple  modalities  are 
involved  in  the  clustering  process. 
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Recall  that  in  Chapter  4  we  considered  Cornraf  models  for  multi-modal  clustering 
where  each  modality  should  be  clustered.  In  Cornraf*  models  only  one  modality  is 
clustered.  The  general  Cornraf  model,  however,  takes  care  of  any  number  of  dense 
and  sparse  random  variables.  In  Section  7.4.2  we  present  a  Cornraf  model  for  si¬ 
multaneously  clustering  images  and  their  local  features,  while  incorporating  other 
(unclustered)  modalities.  Since  the  simultaneous  clustering  can  be  computationally 
hard,  we  also  show  how  to  reduce  the  computational  burden  by  translating  such 
a  Cornraf  model  into  a  series  of  Cornraf*  models,  each  of  which  is  then  optimized 
separately. 


7.3.1  Inference  in  Cornraf* 

In  Cornraf*,  where  all  the  edges  are  attached  to  X £  and  all  the  leaves  are  observed 
combinatorial  random  variables,  Equation  (7.1)  is  transformed  into: 

771—1  771—1 

xCq  =  arg  max  (3J (A0;  X,)  =  arg  max  (ij (A0;  A*),  (7.2) 

Xc  Z J  Xc  Z J 

2—1  2—1 

since  Xi  =  A,;  for  the  unclustered  modalities.  As  always,  we  have  the  A0  =  k 
constraint. 

To  compute  the  weighted  sum  of  pairwise  mutual  information  from  Equation  (7.2), 
the  following  procedure  is  used.  The  input  of  the  procedure  is  an  (empirical)  joint 
distribution  P(A"0,Ab)  of  the  underlying  data  of  each  interacting  pair  (Aq,  A?).  For 
a  given  partitioning  Xq,  the  distribution  P(A0,  A*)  is  computed  using  the  cumulative 
rule  P(x0-,Xi )  =  ^  g-  P(x°,  Xi).  Marginals  P(A"0)  and  P(Aj)  are  obtained  through 
the  marginalization  P(x0)  =  and  P(xi)  =  P{xa-,xi)-  Now  we  have 

all  the  ingredients  to  calculate  the  mutual  information: 


I{XQ-Xi)  =  Y^P{xQ,xi)\og 

xo,Xi 


P(x0,xi) 

P(x0)P(XiY 
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To  perform  an  inference  in  Comraf*,  we  apply  a  version  of  our  MDC  algorithm 
(see  Section  4.3),  where  either  top-down  or  bottom- up  clustering  procedures  is  used 
for  clustering  X0.  In  the  top-down  procedure,  we  start  with  one  cluster  that  contains 
all  the  values  of  X0  and  split  it  until  the  required  number  of  clusters  is  obtained  (while 
interleaving  with  the  optimization  routine).  In  bottom-up  clustering,  we  start  with 
all  singleton  clusters  and  merge  them  until,  again,  reaching  the  required  number  of 
clusters. 

The  computational  complexity  of  the  top-down  algorithm  is  O(l\X0\  Y^j=\  |^Q|); 
and  of  the  bottom-up  algorithm  O(/|X0|2  |Aj|),  where  /  is  a  (fixed)  number  of 

clustering  iterations.  Note  that  an  arbitrary  number  of  leaves  (unclustered  modalities) 
can  be  incorporated  into  the  Comraf*  model,  while  adding  new  modalities  increases 
the  complexity  only  linearly. 

7.4  Modalities  of  an  image  collection 

In  this  work,  along  with  images,  we  consider  three  other  modalities.  The  first 
one  is  words  from  image  captions.  We  remove  stopwords  and  apply  a  simple  ‘s’- 
stemming  (removal  of  plural  suffixes).  A  joint  probability  of  an  image  g  and  a  word 
w  is  P(g,  w)  =  where  Nw£g  is  the  number  of  occurrences  of  w  in  p’s  caption,  \W\ 
is  the  total  number  of  words.  Another  modality  is  colors  appearing  in  images.  The 
joint  probability  distribution  of  colors  and  images  is  obtained  from  color  histograms, 
as  a  number  of  pixels  of  color  c  in  image  g  divided  by  the  total  number  of  pixels  in 
all  images.  The  third  modality  is  blobs,  as  described  below. 

7.4.1  Rectangular  blobs 

Blobs  (or  visual  terms)  are  a  special  type  of  image  content  representation  based  on 
a  fixed  vocabulary.  To  generate  blobs,  images  are  first  segmented  into  regions,  which 
are  then  clustered  across  all  images.  Blobs  are  the  resulting  region  clusters.  Each 
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image  is  mapped  onto  the  set  of  blobs  which  leads  to  in  a  representation  analogous 
to  the  bag-of-words  (BOW)  in  text  processing. 

Barnard  and  Forsyth  [6]  and  Duygulu  et  ah  [37]  segment  images  into  semantically 
coherent  regions  using  Blobworld  and  Normalized-Cuts  algorithms,  respectively.  Un¬ 
fortunately,  these  algorithms  do  not  always  produce  segmentations  accurate  enough 
for  further  use.  Jeon  and  Manmatha  [56]  and  Feng  et  al.  [41]  use  a  rectangular  grid 
to  segment  images  and  report  better  results  on  an  image  retrieval  task.  We  apply 
the  same  set  of  blobs  as  in  [41],  built  using  the  following  procedure.  Images  are  first 
segmented  to  regions  using  a  6  x  4  grid.  Then,  for  each  region,  a  feature  vector  is 
constructed  that  contains  texture  and  color  information:  Gabor  texture  filters  with 
4  orientations  and  3  scales  are  used  to  construct  12  dimensional  texture  features;  the 
mean,  standard  deviation  and  skewness  of  RGB  and  LAB  components  are  computed 
to  build  18  dimensional  color  features.  The  resulting  30  dimensional  feature  vectors 
are  clustered  using  fc- means. 

7.4.2  Blobs  constructed  by  Comraf  models 

As  discussed  in  Section  7.4.1  above,  a  clustering  process  is  involved  in  constructing 
blobs  from  rectangular  regions,  represented  by  color  and  texture  features.  Naturally, 
since  Comrafs  are  models  for  multi-modal  clustering,  an  intrinsic  Comraf  model  can  be 
used  for  simultaneously  clustering  images  and  their  regions.  Co-clustering  of  images 
and  features  has  been  recently  described  in  literature  [89],  however,  Comrafs  have 
an  additional  power  over  co-clustering  methods:  Comrafs  can  incorporate  multiple 
modalities,  both  sparse  (that  are  to  be  clustered)  and  dense  (that  are  not). 

Figure  7.2  (left)  shows  a  Comraf  model  for  clustering  images  G  simultaneously 
with  their  regions  R,  taking  into  account  color  C  and  texture  T  information  of  the 
regions,  as  well  as  the  colors  and  caption  words  W  of  the  images.  Obviously,  more 
edges  and  nodes  can  be  added  to  the  model,  depending  on  the  data’s  availability. 
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Figure  7.2.  (left)  A  Comraf  model  for  simultaneously  clustering  images  Gc  and  their 
rectangular  regions  Rc,  while  taking  into  account  words  Wc  from  image  captions, 
colors  Cc  and  texture  data  Tc;  (right)  a  translation  of  this  model  into  a  two-step 
Comraf*:  the  first  Comraf*  is  for  clustering  regions  into  blobs,  whereas  the  second 
Comraf*  is  for  clustering  images  based  on  these  blobs. 

In  Section  7.3.1  we  mentioned  that  the  input  of  a  Comraf  inference  procedure  is 
a  set  of  pairwise  probability  tables  P(Xt,  X^)  for  each  edge  in  the  Comraf  graph.  An 
interesting  case  is  the  ( Gc ,  Rc)  edge  between  image  and  region  combinatorial  random 
variables  in  Figure  7.2  (left).  Unlike  colors  and  caption  words,  each  region  is  unique, 
so  for  each  region  r  and  each  image  g,  their  joint  probability  is  P(r,  g)  —  ^  if  r  G  g, 
and  0  otherwise  (where  \R\  is  the  total  number  of  regions  in  the  dataset).  Such  a 
probability  mass  function  is  useless  for  clustering  regions,  because  only  regions  that 
belong  to  the  same  image  can  be  clustered  together.  A  possible  way  to  resolve  this 
problem  would  be  to  estimate  this  probability  by  giving  a  portion  of  its  mass  to 
P(r,  g)  even  if  r  ^  g.  Such  an  estimation  can  be  made  based  on  computing  an  affinity 
metric  between  regions  of  various  images,  which  is  computationally  hard:  0(|P|2]r|), 
where  |r|  is  the  size  of  any  region. 

Comrafs  offer  an  elegant  solution  to  this  problem:  since  regions  are  clustered  not 
only  based  on  images,  but  also  based  on  colors  and  texture,  neither  of  which  has 
this  problem,  we  still  can  use  our  objective  function  from  Equation  (7.1).  As  long  as 
images  are  clustered  in  parallel  with  regions,  Equation  (7.1)  allows  grouping  together 
regions  that  belong  to  the  same  image  cluster ,  as  desired.  Therefore,  we  apply  the 
Comraf  model  from  Figure  7.2  (left)  as  it  is.  We  choose  to  cluster  images  bottom-up 
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and  regions  top-down.  In  our  objective  (7.1),  we  cope  with  the  fact  that  I(R;G)  is 
two  orders  of  magnitude  larger  than  /(i?;  C )  and  J(i?;  T),  by  setting  the  weights  of 
the  latter  two  terms  to  100. 

The  simultaneous  clustering  of  images  and  regions  is  a  time  consuming  process: 
its  complexity  is  0(\R\  |Gj  (|C|  +  \T\  +  |W|)).  We  propose  a  light-weight  version  of 
this  model,  in  which  inference  is  done  in  two  steps:  first,  regions  are  clustered  based 
on  their  color  and  texture  features,  and  then  images  are  clustered  based  on  colors, 
caption  words  and  region  clusters.  Such  a  model  is  equivalent  to  two  Comraf*  models 
applied  one  after  another,  as  presented  in  Figure  7.2  (right).  This  model’s  complexity 
is  plausible:  0(\R\  (|C'|  +  |T|)-|-|G|  (|C|  +  |i?|  +  |lT|)),  where  \R\  is  the  number  of  region 
clusters.  Moreover,  in  Section  7.5.2  we  show  that  the  performance  of  the  two-step 
Comraf*  is  plausible  as  well:  on  one  of  our  datasets,  it  obtains  clustering  accuracy 
comparable  to  the  one  of  a  general  Comraf.  Generalizing  the  two-step  setting,  it  is 
easy  to  see  that  any  Comraf  can  be  translated  into  a  series  of  Comraf*  models. 

7.5  Experimentation 

We  experiment  with  a  variety  of  particular  Comraf*  models  (see  examples  in  Fig¬ 
ure  7.1),  as  well  as  with  the  general  Comraf  models  from  Figure  7.2.  The  experiments 
are  conducted  using  our  open-source  Comraf  clustering  tool.2  In  all  our  models,  im¬ 
ages  are  clustered  agglomeratively.  All  our  results  are  averaged  over  10  independent 
runs,  with  the  standard  error  reported.  As  a  baseline,  we  use  the  A;- means  algorithm 
(SimpleKMeans  implementation  of  WEKA3),  where  images  are  represented  as  BOW 
of  their  captions.  Also,  our  2-node  Comraf*  model  is  equivalent  to  the  hard-clustering 
version  of  Information  Bottleneck  (IB)  [106]  (see  Section  4.1  for  discussion),  hence 

2http : / / sourcef orge .net /projects/ comraf 

3http ://cs. waikato . ac . nz/ml/ weka 
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Category 

H  of  images 

Category 

#  of  images 

Birds 

152 

Christianity 

191 

Desert 

172 

Islam 

96 

Flowers 

165 

Judaism 

187 

Trees 

190 

Personalities 

188 

Food 

187 

Symbols 

130 

Housing 

165 

OVERALL: 

1823 

Table  7.1.  Categories  (and  their  sizes)  of  the  Israellmages  dataset. 


we  use  it  as  our  baseline  as  well.  For  evaluation  of  our  clustering  results,  we  use 
micro-averaged  accuracy  (Section  4.6.1)  of  the  constructed  image  clustering. 

7.5.1  Datasets 

We  demonstrate  the  performance  of  our  clustering  methods  on  two  datasets:  a 
subset  of  the  benchmark  Corel  dataset  and  a  new  multimedia  dataset,  which  we  refer 
to  as  Israellmages ,  collected  by  us  especially  for  this  work. 

The  Corel  subset4  has  already  been  used  in  various  previous  research  projects 
[37,  55,  41].  The  dataset  consists  of  5,000  images  from  50  Corel  Stock  Photo  CDs. 
Each  CD  contains  100  images  on  the  same  topic,  such  as  “Sunrises  and  Sunsets”, 
“Mountains  of  America”  and  “Wild  Animals”.  Every  image  has  a  caption  and  an 
annotation.  The  caption  is  a  brief  description  of  the  scene  and  the  annotation  is  a  list 
of  objects  that  appear  in  the  image.  An  example  of  an  image  caption  is  “Man  And  Boy 
Fishing  Mountain  Lake”,  while  “Tree  People  Mountain  Water”  is  an  annotation  for 
this  image.  Overall  371  words  are  used  to  annotate  the  collection.  The  original  dataset 
has  4,500  training  images  and  500  test  images.  Since  our  model  does  not  require 
training,  we  use  4,500  training  images  for  our  experiments  and  save  the  remaining 
500  images  for  future  use. 


4http : / /kobus . ca/research/data/ eccv_2002 
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Method 

Accuracy 

fc-means:  images  over  caption  words 

IB:  images/caption  words 

IB:  images/colors 

Comraf* :  images / words / colors 

General  Comraf:  Figure  7.2  (left) 
Two-step  Comraf*:  Figure  7.2  (right) 

22.0% 

44.2  ±  1.0% 
24.4  ±  0.2% 

54.2  ±  0.9% 
68.6  ±  1.0% 
69.0  ±0.6% 

Table  7.2.  Micro-averaged  clustering  accuracy  on  Israellmages.  All 

IB/Comraf  results  are  averaged  over  10  independent  runs  with  the  standard  error 
of  the  mean  reported  after  the  ‘±’  sign. 


The  second  dataset  consists  of  1823  images  downloaded  from  Israellmages .  com. 
The  images  reflect  different  aspects  of  Israel  scenery  and/or  society  and  are  grouped 
into  11  categories  (see  Table  7.1).  Each  image  is  375  by  250  pixels  and  has  a  1  to  18 
word  long  caption.  This  dataset  is  available  to  the  research  community.5 

7.5.2  Comparative  results 

Our  results  on  the  Israellmages  dataset  are  reported  in  Table  7.2.  Adding  the 
color  modality  to  the  caption  BOW  improves  the  clustering  result  by  10%  (on  an 
absolute  scale),  whereas  adding  the  regions  (in  a  2-step  Comraf*  scheme)  leads  to  an 
additional  15%  improvement.  These  hirelings  demonstrate  the  value  of  multi-modal 
setting  in  image  clustering.  The  general  Comraf  model  from  Figure  7.2  (left)  is  not 
able  to  outperform  the  2-step  Comraf*.  This  is  probably  due  to  the  fact  that  color  and 
texture  information  is  more  important  for  clustering  regions  than  the  correspondence 
between  regions  and  image  clusters. 

We  also  experiment  with  various  levels  of  color  granularity  in  a  3-node  Comraf* 
setting  (from  Figure  7.1b) — the  results  are  presented  in  Figure  7.3  (left).  As  can  be 
seen,  if  the  color  information  is  detailed  enough  (above  216  colors),  the  difference  in 
the  results  is  statistically  insignificant.  Figure  7.3  (center)  shows  the  results  of  the 

°http : / / www . cs . umass . edu/~ronb/ image_clustering . html 
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Method 

Accuracy 

fc-means:  images  over  caption  words 

IB:  images/caption  words 

IB:  images/colors 

IB:  images/blobs  (see  Section  7.4.1) 
Cornraf* :  images / words / colors 
Cornraf* :  images / words/blobs 

Cornraf* :  images / words / colors/blobs 
Two-step  Cornraf*:  Figure  7.2  (right) 

22.0% 

46.6  ±  0.5% 
22.5  ±  0.2% 

24.7  ±  0.3% 

55.3  ±  0.5% 

59.4  ±  0.5% 

60.1  ±  0.3% 

61.2  ±0.4% 

IB:  images/ annotation  words 

58.6  ±  0.3% 

Table  7.3.  Micro-averaged  clustering  accuracy  on  Corel.  All  IB/Comraf 
results  are  averaged  over  10  independent  runs  with  the  standard  error  of  the  mean 
reported  after  the  ‘±’  sign. 


2-step  Cornraf*  over  various  numbers  of  colors  for  clustering  regions.  Generally,  less 
colors  are  needed  for  clustering  regions  than  for  clustering  images:  216  colors  appear 
to  be  too  many. 

A  summary  of  our  results  on  the  Corel  dataset  is  presented  in  Table  7.3.  It 
shows  surprisingly  similar  trends  as  for  Israellmages.  On  a  3-node  setup  with  caption 
words  and  blobs  we  obtain  59.4%  accuracy,  which  is  especially  impressive  given  that  a 
random  assignment  of  images  into  50  clusters  would  lead  to  2%  accuracy  (our  result 
is  30  times  above  random).  Adding  the  color  modality  improves  this  result  only 
insignificantly  (as  expected,  since  blobs  already  incorporate  the  color  information, 
among  with  texture).  The  success  of  3-node  and  4-node  Cornraf*  clustering  models 
is  also  supported  by  the  fact  that  they  outperform  a  2-node  supervised  clustering 
model,  in  which  images  are  clustered  with  respect  to  their  annotations  assigned  by 
human  experts. 

The  2-step  Cornraf*  shows  some  further  (minor)  improvement  over  the  1-step 
Cornraf*  models.  Here,  in  contrast  to  Israellmages,  8  colors  are  enough  for  clustering 
regions,  and  adding  more  colors  causes  a  significant  drop  in  the  performance.  We 
suspect  that  the  Corel  dataset  is  “too  simple”:  it  contains  many  images  that  are 
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Figure  7.3.  Experimentation  with  various  numbers  of:  (left)  colors  on  Israellm¬ 
ages  in  a  3-node  images/words/colors  Comraf*;  (center)  colors  for  clustering  re¬ 
gions  in  the  2-step  Comraf*  on  Israellmages;  (right)  blobs  on  Corel  in  a  3-node 
images/ words/blobs  Comraf*.  Our  baseline  is  the  2- node  images/words  clustering 
result.  Left  and  right  graphs  show  the  same  trend:  after  reaching  a  certain  number  of 
colors  (256)  or  blobs  (2000),  the  results  vary  only  insignificantly.  The  central  graph, 
however,  shows  that  too  many  colors  for  clustering  regions  can  hurt. 


almost  identical  to  each  other,  therefore  more  advanced  clustering  models  lead  to  no 
(or  minor)  gain  over  the  simpler  ones. 

Analogously  to  our  Israellmages  experiment  with  various  sizes  of  color  sets,  we 
test  various  numbers  of  blobs  on  Corel.  In  previous  work  [37,  55],  the  number  of 
blobs  is  set  to  500,  to  (roughly)  correspond  to  the  number  of  annotation  keywords. 
Here  we  show  that  500  blobs  are  not  enough  for  clustering:  when  moving  from  1000 
to  2000  blobs,  a  significant  boost  in  the  system’s  performance  can  be  seen. 

Figures  7.4  and  7.5  are  illustrations  of  the  quality  of  multi-modal  setup:  unrelated 
groups  of  images  are  mixed  together  when  the  clustering  is  based  only  on  caption 
words,  whereas  they  are  nicely  separated  when  a  visual  modality  is  added. 


7.6  Summary 

In  this  chapter,  we  have  introduced  the  Comraf  framework  for  clustering  multime¬ 
dia  collections.  We  have  also  proposed  a  family  of  lightweight  Comraf  models  called 
Comraf*,  which  demonstrate  excellent  performance  on  clustering  two  real-world  im¬ 
age  collections.  To  further  improve  the  image  clustering  results,  a  semi-supervised 
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Comraf  setting  (see  Chapter  5)  can  be  used,  in  which  a  few  labeled  examples  are 
taken  into  account  in  the  clustering  process.  We  plan  to  experiment  with  this  setting 
in  our  future  work. 

Designing  general  Comraf  models  for  image  clustering  (in  the  flavor  of  the  model 
shown  in  Figure  7.2  left)  is  an  ongoing  process.  Extensive  experimentation  will  lead 
to  discovering  the  optimal  Comraf  setting  for  clustering  multimedia  collections. 
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(a)  Clustering  results  using  only  caption  words,  Corel  dataset 


(b)  Clustering  results  using  words  and  blobs,  Corel  dataset 


Figure  7.4.  Corel  dataset.  The  first  row  shows  clustering  results  using  only  words. 
Swimmers  and  swimming  tigers  are  clustered  together  because  they  share  common 
terms  like  “water”  and  “swim” .  The  second  and  the  third  rows  show  clustering  results 
using  both  words  and  blobs.  The  swimmers  and  the  swimming  tigers  are  now  in  two 
different  clusters  with  other  similar  images. 


(a)  Clustering  results  using  only  caption  words,  Israellmages  dataset 


(b)  Clustering  results  using  words  and  color  histograms,  Israellmages  dataset 


Figure  7.5.  Israellmages  dataset.  People  portraits  and  pictures  of  the  menorah  mon¬ 
ument  are  clustered  together  using  caption  words  because  they  have  a  word  ‘Knesset’ 
(the  Israeli  parliament)  in  common:  the  individuals  are  Knesset  members,  while  the 
menorah  monument  is  placed  in  front  of  the  Knesset  building.  The  problem  is  resolved 
after  the  color  modality  is  added. 
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CHAPTER  8 


CONCLUSION 


In  this  thesis  we  have  introduced  Combinatorial  Markov  Random  Fields  (Comrafs) — 
a  novel,  generic  framework  for  statistical  modeling  that  consists  of  three  basic  com¬ 
ponents: 

1.  An  undirected  graph  with  nodes  being  statistical  objects  of  “rich”  structure  and 
edges  being  interactions  between  those  objects; 

2.  An  objective  function  (either  probabilistic  or  non-probabilistic)  factored  over 
the  graph; 

3.  A  method  for  optimizing  the  objective. 

We  have  applied  the  Comraf  framework  to  multi-modal  learning,  which  is  a  learning 
problem  in  the  environment  where  multiple  views  (or  modalities )  of  the  data  are 
available.  Based  on  the  material  presented  in  previous  chapters,  we  can  give  an 
ultimate  recipe  for  solving  multi-modal  learning  problems  with  Comrafs: 

1.  Come  up  with  a  few  modalities  for  a  particular  dataset.  In  most  cases,  it  is 
easy  to  come  up  with  two  modalities:  one  for  data  instances,  another  for  their 
features.  Once  a  few  data  types  or  feature  types  are  available,  they  can  be 
represented  as  modalities.  Note  that  a  modality  is  a  set  over  which  a  proba¬ 
bility  distribution  can  be  defined.  Comrafs  are  unlikely  to  be  useful  in  cases 
where  data  instances  are  represented  as  feature  vectors,  where  each  feature  is 
intrinsically  different  from  the  others  (e.g.  where  feature  vectors  consist  of  four 
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features:  color,  size,  temperature,  and  price,  such  that  it  is  difficult  to  define  a 
probability  distribution  over  this  feature  set). 

2.  Decide  which  modalities  interact  with  each  other.  This  decision  should  be  made 
upon  availability  of  contingency  information  for  each  pair  of  modalities,  as  well 
as  based  on  domain  knowledge  (e.g.  whether  or  not  it  is  natural  to  see  these 
modalities  interacting).  For  example,  say  we  are  given  three  modalities  images, 
their  colors,  and  their  caption  words.  Captions  definitely  interact  with  images, 
as  well  as  colors  interact  with  images.  However,  we  can  assume  that  captions  do 
not  interact  with  colors,  as  it  is  a  very  rare  case  where  captions  directly  describe 
colors  in  an  image.  Note  that  a  decision  about  presence  /  absence  of  interactions 
is  analogous  to  defining  conditional  independencies  in  other  types  of  graphical 
models:  the  number  of  interactions  should  be  kept  as  low  as  possible  in  order 
to  keep  the  model  tractable. 

3.  For  a  particular  learning  task,  decide  which  modalities  should  be  optimized  and 
which  should  be  observed.  Observed  modalities  usually  provide  some  level  of 
supervision  to  the  model:  using  observed  modalities,  prior  knowledge  can  be 
represented.  Also,  a  modality  can  be  observed  if  its  size  is  very  small,  such  that 
a  distribution  defined  over  this  modality  is  statistically  dense. 

4.  Represent  those  modalities  that  are  to  be  optimized  as  hidden  combinatorial 
random  variables,  and  those  that  are  not  as  observed  combinatorial  random  vari¬ 
ables.  A  combinatorial  r.v.  can  be  defined  over  a  set  of  possible  partitionings, 
subsets,  partial  orderings  of  a  modality,  or  over  other  types  of  combinatorial 
sets,  according  to  the  particular  problem  being  solved. 

5.  Represent  each  combinatorial  r.v.  as  a  node  in  a  graph  in  which  undirected 
edges  correspond  to  interactions.  We  now  have  finished  building  the  Comraf 
graph. 
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6.  Represent  the  learning  task  as  optimization  of  an  objective  function  that  is  de¬ 
fined  over  nodes  and  edges  in  the  Comraf  graph.  Choose  the  objective  function 
that  best  suits  the  task.  The  choice  of  objective  function  can  be  made  based  on 
previously  published  work  in  the  field  (as  in  Chapter  4),  or  based  on  theoretical 
analysis  (as  in  Chapter  6),  as  well  as  based  on  some  pilot  research  or  other 
considerations. 

7.  Since  optimizing  the  objective  function  simultaneously  over  the  entire  Comraf 
graph  appears  to  be  intractable,  propose  a  method  for  traversing  the  Comraf 
graph  in  order  to  perform  iterative  optimization  of  the  objective.  In  this  thesis, 
we  have  discussed  two  such  methods:  Iterative  Conditional  Modes  (ICM)  and 
Clique-wise  Optimization  (CWO) — see  Section  3.3.  ICM  can  be  considered  a 
global  optimization  method,  as  the  objective  is  optimized  at  each  node  condi¬ 
tionally  on  the  rest  of  the  model.  In  contrast,  CWO  is  a  local  optimization 
method,  as  the  objective  is  optimized  over  each  clique  independently  of  the  rest 
of  the  model. 

8.  At  each  node  /  clique,  apply  a  combinatorial  optimization  method  for  optimizing 
the  objective.  In  Section  4.3,  we  proposed  two  simple  and  greedy  combinatorial 
optimization  methods  (sequential  and  shuffled),  both  of  which  explore  the  local 
neighborhood  of  an  initial  configuration.  More  sophisticated  methods,  such  as 
Branch  and  Bound,  can  be  used  as  well. 

9.  As  the  global  optimum  of  the  objective  function  is  unlikely  to  be  found,  propose 
a  stopping  criterion  of  the  optimization  procedure.  In  our  experimentations 
with  multi-modal  clustering,  we  halted  the  optimization  procedure  as  soon  as 
the  desired  number  of  clusters  was  achieved. 

We  applied  the  proposed  framework  to  multi-modal  clustering  (Chapter  4),  semi- 
supervised  learning  (Chapter  5),  and  one-class  clustering  (Chapter  6).  Both  text  and 
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image  domains  (Chapter  7)  were  explored.  Three  important  issues  have  not  been 
addressed  in  this  thesis — we  leave  them  for  our  future  work: 

•  Multi-modal  ranking,  which  is  another  application  of  the  Comraf  framework. 
In  the  multi-modal  ranking  problem,  one  simultaneously  ranks  a  number  of 
modalities,  given  rankings  of  the  other  modalities.  One  example  of  multi-modal 
ranking  comes  from  the  data  mining  /  collaborative  filtering  area:  given  a 
ranking  of  movies,  the  task  is  to  simultaneously  rank  its  directors  and  actors 
who  starred  in  those  movies.  The  goal  of  such  system  would  be  to  adequately 
measure  popularity  of  celebrities.  Another  example  comes  from  image  retrieval: 
given  a  ranked  list  of  documents,  retrieved  on  a  certain  query,  the  task  is  to 
simultaneously  rank  images  in  these  documents  and  their  local  features  (blobs 
or  interest  points).  Our  intuition  here  is  that  the  simultaneous  ranking  would 
improve  the  quality  of  image  ranked  lists.  Note  that  the  layout  of  Comraf  graphs 
for  multi-modal  ranking  is  the  same  as  for  multi-modal  clustering.  However, 
the  objective  function  and  optimization  method  should  be  specific  for  the  multi¬ 
modal  ranking  task. 

•  Scalability  issue  in  Comrafs.  Unfortunately,  the  current  version  of  our  MDC 
implementation  for  multi-modal  clustering  is  very  slow.  Given  that  each  opti¬ 
mization  iteration  is  repeated  ten  times  (i.e.  ten  random  restarts),  a  straightfor¬ 
ward  enhancement  would  be  to  perform  those  random  restarts  in  parallel  on  ten 
machines.  Another  possible  enhancement  would  be  to  limit  the  search  length 
in  the  shuffled  version  of  MDC,  or  to  parallelize  the  shuffling  steps  using  the 
MapReduce  paradigm  [33]. 

•  Model  learning  in  Comrafs.  It  turns  out  that  the  main  factor  for  achieving 
good  clustering  results  with  Comraf  models  is  the  good  choice  of  modalities  and 
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their  interactions.  It  is  desired  to  construct  a  system  that  could  a  priori  decide 
whether  the  available  modalities  would  be  helpful  or  harmful. 

In  addition  to  being  a  useful  framework  for  multi-modal  learning,  Comrafs  can  go 
beyond  it:  Comraf  nodes  do  not  necessarily  have  to  represent  data  modalities.  Also, 
random  variables  of  rich  structure  may  not  necessarily  be  of  the  combinatorial  nature, 
so  Comrafs  have  good  potential  for  a  generalization  into  a  new  framework.  We  call  it 
Non- Bayesian  Networks  (NBN).  Development  of  such  a  framework  is  also  the  subject 
of  our  future  work.  An  interesting  question  yet  to  be  answered  is  how  to  model  NBN’s 
nodes  (which  are  structurally  rich)  in  a  finer-grained  manner.  A  possible  answer  is 
to  use  a  lower-level  NBN  at  each  node  of  the  (upper-level)  NBN,  by  which  we  build 
a  telescopic  model.  Constructing  such  a  model  would  resemble  designing  an  object- 
oriented  software  system,  which  has  a  direct  connection  with  the  power  framework  of 
Object-Oriented  Bayesian  Networks  [61].  Developing  Object-Oriented  Non-Bayesian 
Networks  would  be  the  final  goal  of  this  research. 

To  conclude,  the  contributions  of  this  thesis  are: 

1.  Proposing  Comrafs — a  novel  framework  for  statistical  modeling  that  brings  to¬ 
gether  two  research  fields:  graphical  models  and  combinatorial  optimization. 

2.  Applying  this  framework  to  a  variety  of  problems  in  multi-modal  learning,  such 
as  multi-modal  clustering,  semi-supervised  clustering,  interactive  clustering, 
one-class  clustering  etc. 

3.  Proposing  model  layouts,  objective  functions  and  optimization  procedures  for 
each  of  these  problems. 

4.  Showing  empirical  advantage  of  Comrafs  over  previous  state-of-the-art  methods 
on  various  real-world  tasks,  such  as  Web  appearance  disambiguation,  document 
clustering  by  genre  and  author’s  sentiment,  organization  of  image  galleries  etc. 
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APPENDIX  A 


PROOF  OF  THEOREM  6.2.1 


First,  note  that  since  both  distributions  Pr  and  Pg  are  uniform,  then  PiWij  = 
w\zij  =  1)  =  W  and  P{Wij  =  w | Zij  =  0)  = 

Let  us  now  compute  the  marginal  PiWij  =  w).  For  a  relevant  word  wr,  let  us 
denote  it  PiWij  —  wr)  =  pr: 


pr  =  PiyVij  =  wr)  =  PiWij  =  wr\Zij  =  l)P(Zij  =  1)  + 

PiWij  =  wr\ Zij  =  0 )P(Zy  =  0)  =  — 7T  +  —(1  -  7 r)  (A.l) 

TYl/y  TYL 

For  a  non-relevant  word  wn,  denote  P{WtJ  —  wn)  =  pn\ 

pn  =  PiW^  =  wn)  =  PiW^  =  wn\ Z^  =  l)PiZij  =  1)  + 

PiWij  =  wn\ Z^  =  0 )PiZij  =  0)  =  0  ■  7T  +  —(1  -7 r)  =  —(1  -  7r) 

m  m 

We  assume  that  the  difference  between  these  two  probabilities  is  substantial,  that  is 
pr  —  pn  =  7T jmr  >>  0.  Let  r  be  their  arithmetic  mean: 


T=  \{Pr+Pn)- 


(A.2) 


For  each  word  w,  we  introduce  a  random  variable  Xw  of  its  count  (the  number  of 
its  occurrences  in  the  dataset),  which  is  distributed  binomially:  if  w  is  relevant,  then 
Xw  ~  Biipr,  N )  and  its  mean  is  prN\  if  w  is  non-relevant,  then  Xw  ~  Biipn,  N )  with 
mean  pnN.  We  are  interested  in  bounding  the  probability  that  Xw  <  tN  for  relevant 
words,  and  that  Xw  >  rN  for  non-relevant  words. 

Using  Chernoff  bound  for  a  relevant  word  w,  we  have: 

P(XW  <  tN )  <  exp  (-fV^-)2)  <  e.  (A.3) 

For  a  non-relevant  word  w  we  have: 

Pi Xw  >  tN )  <  exp  (-^-^)  <  e.  (A.4) 

Solving  (A.3)  and  (A.4)  simultaneously  with  respect  to  N,  we  have: 


N  >  max 


(  2Pr  hi  \  3prahW  \ 
\iPr  -  r)2’  ipn  -  t)2J 


1 
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and  then  substituting  r  from  (A. 2): 


N  >  max 


/  8pr  111  j:  12pn  In  i  \ 

\(Pr  ~Pn)2’  ( Pr-Pn )2  ) 


8 pr  In  1 
(Pr-P^21 


where  we  use  the  given  constraint  that  pw  <  2tt  (and  thus  3 pn  <  2 pr,  so  the  first  term 
is  always  greater  than  the  second  one).  Substituting  pr  —  pn  =  and  applying  the 
definition  of  pr  from  (A.l),  we  get: 


8 pr  In  1 
(Pr  ~  Pn)2 

< 


,7 T  +  pw~  7 Tpv 
pwm 


2  2  i 

Pw171  1  1 

111 

7rz  e 


^pw7Ti  1  .  1 

•In  -  <  16 —  In 


ti* 


71 


where  we  used  the  fact  that  7 1  +  pw  —  7ipw  <  1  and  that  pw  <  2n.  Finally,  we  choose 
the  value  of  N  to  be  the  minimum  among  all  the  possible  choices: 


N 


Putting  it  all  together:  What  is  the  probability  that  there  exists  a  word  w  which  was 
not  detected  correctly?  Using  the  union  bound  we  get: 

P  <  tN)  (J  Uwin(Xw  >  tN )  j  <  em  =  5, 


so  e  =  and  then 

777.  7 


which  is  log- linear  in  m. 


N 


m , 

16 —  In 

71 


m 

T’ 


□ 
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APPENDIX  B 


DETAILS  OF  EM  ALGORITHM  FOR  ONE-CLASS 

CLUSTERING 


Given  the  graphical  model  from  Figure  6.1,  the  joint  distribution  is: 


par},  {z},  m) = n  F<y<)  n  ip(z» \Yi)pMza)]  (b-1) 

i  1 

Note  that  we  represent  a  document  di  as  its  Bag-Of- Words:  di  =  {%,  wi2, . . . ,  w^]}- 
Let  us  now  dehne  EM  parameters  0: 

P(Yi  =  1)  =  pd  (B.2) 

For  each  document  di\™_x:  P{Zij  =  1| Y,L  =  1)  =  7Tj  (B.3) 

P(Zij  =  l\Yi  =  0)  =  0  (B.4) 

For  each  word  wi |[L1:  P{w{\Zi  =  1)  =  pr(uii)  (B.5) 

P{wi\Zi  =  0)  =  pg(wi)  (B.6) 


Using  this  notation,  the  marginal  distribution  of  a  document  is  written  as: 

P(<U)  =  P(Za|Ui)TKi|^i)  ^PiZalYjPiwvlZa) . . . 

Yi  zn  Zi  2 

E  P(Zi\di\\Yi)P(wi\di\\Zi\di\) 


di\ 


E  P(Yi)  n  (p(%  1|L)PK-|%  =  1)  +  P(%  =  0|1 -)TKI%  =  0)) 

Yi  1= 1 


I  di  |  |  di 

=  Pd  JJ  Oi  Pr(Wij )  +  (1  -  7 n)  Pg(Wij ))  +  (1  -  pd)  jjpg(wp) 


(B.7) 


1=1 


1=1 


E-step 

Given  the  current  set  of  parameters  Ok  at  iteration  k,  for  each  document  di  and 
each  word  Wij  in  di ,  we  compute  the  posteriors: 


Pk(Yi  =  l\di,  0fc) 


P(di\Yl  =  l,ek)P{Yi  =  l\Qk) 
P((U  |0fc) 
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Pd  nl=l  i  Pr{Wij)  +  (1  -  7T f)  ^(my)) 


Pd  nS'l  W  Pr(<%)  +  (!  -  ^i)  Pg{Wij))  +  (1  -  Pd)  nl=l  Pg{Wij) 

s- - -v- - 

denote  a? 


pfc(^  =  i,  %  =  i  K  efc)  =  pfc(^  =  i|di,  efc)P(%  =  i|f4  =  i,  efc) 

9fc)  x 

P(wij,  Zij  =  l\Yi  =  1,  @A:) 


P(w*j,  Zij  =  l\Yi  =  1,  Qk)  +  P(wij ,  %  =  0|Fi  =  1,  0fc) 

_/c 


Pk(Y 

=  1  \di, 

Pk(Yi 

=  1  \di, 

P(Wij, 

z^  = 

Pk(Yi 

=  1  \di, 

«?  Pk 

Pk(Yi 

=  1  \di, 

a\  (1  - 

-  m 

Pk(Yi 

=  1  ,Zy 

1  -PA 

:(Z,j  = 

<  Pr(wij ) 


vrf  +  (1  -  7if)  £*(«;#) 


denote  f3kj 


(1  -  <)  p*(r%) 


vrf  +  (1  -  vrf)  p*{wij) 

^ - v - - - ' 

this  term  is  (1  —  fik-) 


pk{Zij  =  i| di,  ek)  =  p\Yi  =  i,  %  =  i| du  ek)  4  ak  pk 


tj 
.  k  nk 


Pk(Zij  =  0\di,Ok )  =  l-Pfc(%  =  lM*,0fc)  =  l-«f 


M-step 

We  maximize: 


Q(0‘+1|0‘)  =  J]B[log(P(y(,{Zi)'},{m«}|0‘+1)|f>1 


Eb 


log  myj|e‘+1)n^(z«Xi,e‘+1)n^KI%,e‘+1)j  P 


Y E  K (mie‘+1))  I Pk]  +YE  V°&(P(Zij\Yi,Qk+1))  I p 


h3 


denote  A  denote  B 

Y  E  flog  (Piu’iAZij,  e‘+1))  l-P*1 


h3 


denote  C 

The  A  portion  should  not  be  optimized,  because  pd  is  a  constant  in  our  setting. 
B  =  J2  E  [log  (P(%ly*>  ©fe+1))  \Pk 
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=  J2pk(Yi  =  ^  =  lM*)log  (P(Zij  =  l\Yi  =  l,Qk+1)) 

id 

+  ^  Pk{Yl  =  1,  Zi5  =  0| di)  log  (P(Ztj  =  0\Yi  =  1,  0fc+1)) 
id 

+  Pk(Yt  =  0,  Zij  =  1| di)  log  (P(ZtJ  =  l\Yi  =  0,  0fc+1)) 
id 

+  pk(Y*  =  o,  Zij  =  o|  di)  log  (P(ztj  =  0|y,  =  o,  0fc+1)) 

id 

=  X>‘ A‘ log  W+1) 

id 

+  J2  ai  (L  _  Pij)  loS  (x  ~  * i+1 ) 

id 

+E°iog  (°)  +Xe1o§(1) 

id  id 


c  =  ^s[iog(PK|Zy,e‘+1))|P1' 

id 

=  J2P(Zi3  =  lK)log  {P(<r,,  zu  =  l,Qk+1)) 

id 

+  ^  P(%  =  0| di)  log  {P(,r,:  zu  =  0,  0fc+1)) 

id 

=  ^of^iogOif+'K-)) 

id 

+  ^a-aj/^iog&^K)) 

id 

Now  we  compute  derivatives  of  Q(Qk+1  |0fc)  with  respect  to  7rf+1,  pk+1(wi),  pk+1(wi) 
and  find  their  values.  First,  let  us  find  the  optimal  value  of  nk+1. 


dQ 

dnk+1 


dB  _  d 
dnk+1  ~  dTTk+1 

1 


V  a‘  /3g  log  7T‘+‘  +  £  of  (1  -  4 )  log ( 1 


7r- 


fc+i\ 


=  a,- 


1  —  w 


fc+i 


=  0 


7T, 


fc+1 


E,-A 


fc 
■j 


EjAi  +  E^1-  Pij)  \di\ 


Y.% 


Second,  let  us  find  the  optimal  value  of  pk+l{wi)\ 
dQ  _  dC 

dpk+1(wi)  dp^+l(wi) 
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d 


dpk+1(wi) 

d 

dpk+1(wt) 


ai  Pij  l0g  (■ Pr+1(WV ))  +  A  (  1  -  5^Pr+1M 


=  wi )  ak  log  {pk;+1(wi))  +  A  (  1  -  ^pk+1(wi) 


ho 


Eij  =  wz)  <**  Pij 

pk+1(wi) 

Pr  +  1(wi)  =  y  <5(l%  =  Wi)  a-  /3y 


A  =  0 


h3 


m  m  .. 

i  =  ^P;+Iw  =  EiE',(ro«  =  ”'i)a?4  =  IEa.‘« 


A 

z  »,j 


A  =  5>*/J 


fc+i ,  v  Eij 5 K  =  ^0  ai  E*  «*■  Ej  5  Oh  =  wj)  % 

Pr  (Wl)  =  ~  J 


ai  Pij  E *  O  Ej  Pij 


Finally,  let  us  find  the  optimal  value  of  yA+1  (u’(): 
dQ 


dC 


dpk+l(wi)  dpk+l(wi ) 

d 


dpk+1(wi ) 


d 


X^1  _  ai  Pij) log  (Pg+1(wij))  +  A  (  1  -  Xo^OO 

i,j  V  l 

X  5Oh  =  wi)(l-ak  Pk)  log  (: pkg+1(wi )) 


h3 


dpk+1(wi ) 

+A  f1  “X^XOz) 

EiQOh  =  wi){l-a!? 

P3+1Oz) 


A  =  0 


fc+1 ,  v  Eij  6 ( wij  =  WP (!  -  ai  Pij)  Nw  -  El  ai  Ei  5Oh  =  wz)  Pij 

Pq  {w{)  =  —  ■'  ~ 


EO1  -  <*?  $*■) 


^-EXE7A 


3  nj 


EM  algorithm 

To  compute  a*  and  0i3  efficiently,  let  us  use  the  following  relations: 


Pd  nS  O  Pr(Wij)  +  (1  -  7 Ti)  Pg(Wij)) 


Oii  = 


Pd  nS  O  PrOh)  +  (1  -  7Ti)  P3Oh))  +  (1  -  Pd)  n^P,K) 


Mil 


1  + 


n 


Pg(wij) 


1-Pd 

Pd  Pr-(Wii)  +  (l-7Ti)  Pg(wij) 


1  + 


TiPd  rr  _ 

Pd  1 1?  pj±hP 

1  Pg  (wij ) 


+  1  —  TTj 
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1 


/% 


1  + 


i-Pd 

Pd 


(  7 H  Pr(wij)  \  ,  x 

n3-(  u-^)  PS(A,)+1J(1  d 


7Tj  [)r  ( U?ij  ) 


TTi  PriWij)  +  (1  -  7 n)  Pg(Wij)  1  +  i=E i  1  + 


TTi  pr(wij) 


*i  Priwjj) 
(i-’Ti)  Pg(wij) 


Uj 


Tlf  PriWij) 

Then  we  have 

(l-Trf)  p*(wi:j)' 

ak  = 

1 

(B.8) 

1~Pd 

1  -4-  Pd 

Bk .  = 

A-Y; 

1 

(B.9) 

1  +  “F 

Algorithm: 

1.  Initialization: 

(a)  For  each  document  di :  7if 

(b)  For  each  word  tup  p®(w{)  < 


Pv 


score(wi) 


XV  score 


and  AjM 


score(wi ) 


E 


w 


e(u),/) 


2.  For  each  document  dp 

(a)  For  each  word  wig  calculate  7^,  and  then  /3^  using  (B.9). 

(b)  Accumulate  n  .(7^  +  1)(1  —  77).  Calculate  ctf  using  (B.8). 

(c)  Accumulate  YljPij-  Calculate  7rf+1  <—  p-y  Yhjfiij- 

3.  Over  all  documents,  accumulate  ijjk  Y2iaiYlj  Pij- 

4.  Rank  documents  in  decreasing  order  of  erf.  Stop  if  the  ranking  has  not  changed 
since  the  previous  iteration.1 

5.  For  each  word  wi 

(a)  Over  all  documents,  accumulate  gk  <—  Yliai^2j  $(wij  —  wi)  Pij- 

(b)  Calculate  pk+l(wi)  <-  and  pk+1(w{)  =  W  . 

6.  k  <—  k  +  1.  Go  to  2. 


1  Alternatively,  the  algorithm  can  be  terminated  after  a  predefined  number  of  EM  iterations. 
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